View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Multiagent Planning with Factored MDPs
Carlos Guestrin
Stanford University
Collaborative Multiagent Planning
Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control
Long-termgoals
Multiple agents
Coordinateddecisions
CollaborativeMultiagentPlanning
Exploiting Structure
Real-world problems have:
Hundreds of objects Googles of states
Real-world problems have structure!
Approach: Exploit structured representation to obtain efficient approximate solution
peasant
footman
building
Real-time Strategy GamePeasants collect resources and buildFootmen attack enemiesBuildings train peasants and footmen
Joint Decision Space
State space: Joint state x of entire system
Action space: Joint action a= {a1,…, an} for all agents
Reward function: Total reward R(x,a)
Transition model: Dynamics of the entire system P(x’|x,a)
Markov Decision Process (MDP) Representation:
Policy
Policy: (x) = aAt state x, action a for all agents
(x0) = both peasants get woodx0
(x1) = one peasant gets gold, other builds barrack
x1
(x2) = Peasants get gold, footmen attack
x2
Value of Policy
Value: V(x)Expected long-term
reward starting from
xStart from x0
x0
R(x0)
(x0
)
V(x0) = E[R(x0) + R(x1) + 2 R(x2) + 3 R(x3) + 4 R(x4) + ]
Future rewards discounted by 2 [0,1)x1
R(x1)
x1’’
x1’R(x1’)
R(x1’’)
(x1
)x2
R(x2)
(x2
)x3
R(x3)
(x3
) x4
R(x4)
(x1’)
(x1’’)
Optimal Long-term Plan
Optimal Policy: *(x)
Optimal value function V*(x)
'
)'(),|'(),(max)(x
axaxxaxx VPRV
Optimal policy:)a,x(maxarg)x(
a
Q
Bellman Equations:
'
)'(),|'(),(),(x
xaxxaxax VPRQ
Solving an MDP
Policy iteration [Howard ’60, Bellman ‘57]
Value iteration [Bellman ‘57]
Linear programming [Manne ’60]
…
Solve Bellman equation
Optimal value V*(x)
Optimal policy *(x)
Many algorithms solve the Bellman equations:
LP Solution to MDP
Value computed by linear programming:
One variable V (x) for each state One constraint for each state x and action a Polynomial time solution
[Manne ’60]
),(
:subject to
:minimize
, ax
xa
x
Q)(xV
)(xV )(xV
, ax)(xV
Planning under Bellman’s “Curse”
Planning is Polynomial in #states and #actions
#states exponential in number of variables
#actions exponential in number of agents
Efficient approximation by exploiting structure!
F’
E’
G’
P’
Structure in Representation: Factored MDP
State Dynamics Decisions Rewards
Peasant
Footman
Enemy
Gold
RComplexity of representation:Exponential in #parents (worst
case)
[Boutilier et al. ’95]t t+1TimeAPeasant
ABuild
AFootman
P(F’|F,G,AB,AF)
Structured Value function ?Factored MDP Structure in V*
Y’’
Z’’
X’’
R
Y’’’
Z’’’
X’’’
Time t t+1
R
Y’
Z’
X’
t+2 t+3
R
Z
Y
X
R
Factored MDP Structure in V*
Almost!
Structured V yields
good approximate value
function
?
Linear combination of restricted domain functions [Bellman et al. ‘63][Tsitsiklis & Van Roy ’96][Koller & Parr ’99,’00][Guestrin et al. ’01]
Structured Value Functions
Each hi is status of small part(s) of a complex system: State of footman and enemy Status of barracks Status of barracks and state of footman
Structured V Structured Q
Must find w giving good approximate value function
i
ihiwV )()(~
xx
i
iQQ~
small #of Ai’s, Xj’s
Approximate LP Solution
:subject to
, ax
:minimize x
),( xaQ)(xV
)(xV
),( xa
iiQ)( x
iii hw
)( xi
ii hw
One variable wi for each basis function Polynomial number of LP variables
One constraint for every state and action Exponentially many LP constraints
)( xi
iihw
)( xi
iihw
, ax
[Schweitzer and Seidmann ‘85]
,),()( :subject to
axxaxi
ii
ii Qhw
Representing Exponentially Many Constraints
)x()x,a( :to subject max0x,a
i
iii hwQ
Exponentially many linear = one nonlinear constraint
,)(),(0 :subject to
axxxai
iii hwQ
[Guestrin, Koller, Parr ’01]
Maximization over exponential space
)x()x,a( :to subject max0x,a
i
iii hwQ
Variable Elimination
i
iii hwQ )x()x,a(maxx,a
A
D
B C
1f
4f 3f
2f
Here we need only 23, instead of 63 sum operations
),(),(),(max 121,,
CBgCAfBAfCBA
),(),(max),(),(max 4321,,
DBfDCfCAfBAfDCBA
),(),(),(),(max 4321,,,
DBfDCfCAfBAfDCBA
Variable elimination to maximize over state space [Bertele & Brioschi ‘72]
Maximization only exponential in largest factor Tree-width characterizes complexity
Graph-theoretic measure of “connectedness” Arises in many settings: integer prog., Bayes nets, comput. geometry, …
Structured Value Function
i
iii
XXAA
hwQ
m
n
)(),(
,,,,
1
1
max
small #of Ai’s, Xj’s
small #of Xj’s
Representing the Constraints
Use Variable Elimination to represent constraints:
),(),(max),(),(max0 4321,,
DBfDCfCAfBAfDCBA
),(),(
),(),(max0
43),(
1
),(121
,,
DBfDCfg
gCAfBAf
CB
CB
CBA
Number of constraints exponentially smaller!
)x()x,a(max0 :to subjectx,a
i
iii hwQ i
iii hwQ )x()x,a(maxx,a
i
iii hwQ )x()x,a(maxx,a
Understanding Scaling Properties
Explicit LP Factored LP
k = tree-width
2n (n+1-k)2k
Explicit LP
0
10000
20000
30000
40000
2 4 6 8 10 12 14 16number of variables
nu
mb
er o
f co
nst
rain
ts
Factored LP
k = 3
k = 5
k = 8
k = 10
k = 12
Network Management Problem
Ring
Star
Ring of Rings
k-grid
Computer status = {good, dead, faulty}
Dead neighbors increase dying probability
Computer runs processes
Reward for successful processes
Each SysAdmin takes local action = {reboot, not reboot }
Problem with n machines 9n states, 2n actions
Running Time
0
500
1000
1500
2000
2500
3000
0 2 4 6 8 10 12
number of machines
Ru
nn
ing
tim
e (
s)
RingExact solution
RingSingle basis k=4
StarSingle basis
k=4
3-gridSingle basis
k=5
StarPair basis
k=4RingPair basis
k=8
k – tree-width
Summary of Algorithm
1. Pick local basis functions hi
2. Factored LP computes value function
3. Policy is argmaxa of Q
Large-scale Multiagent Coordination
Efficient algorithm computes V Action at state x is:
)a,x(maxarga
Q
#actions is exponential Complete observability Full communication
Distributed Q Function
Q(A1,…,A4, X1,…,X4)
[Guestrin, Koller, Parr ’02]
2
3
4
1
Q4
≈
Q2(A1, A2, X1,X2)
Q4(A3, A4, X3,X4)
Q1(A1, A4,
X1,X4) Q3(A2, A3, X2,X3)+
++
Each agent maintains a part of the Q function
Distributed Q
function
Multiagent Action Selection
2
3
4
1
Q2(A1, A2, X1,X2)
Q4(A3, A4, X3,X4)
Q1(A1, A4,
X1,X4)
Q3(A2, A3, X2,X3)
Distributed Q
function
Instantiate current state x
Maximal action
argmaxa
Instantiate Current State x
2
3
4
1
Q2(A1, A2, X1,X2)
Q4(A3, A4, X3,X4)
Q1(A1, A4,
X1,X4)
Q3(A2, A3, X2,X3)
Q2(A1, A2)
Q3(A2, A3)
Q4(A3, A4)
Q1(A1, A4)
Observe only
X1 and X2
Instantiate current state x
Limited observability: agent i only observes variables in Qi
Multiagent Action Selection
2
3
4
1
Distributed Q
function
Instantiate current state x
Maximal action
argmaxa
Q2(A1, A2)
Q3(A2, A3)
Q4(A3, A4)
Q1(A1, A4)
Coordination Graph
Q2(A1, A2)
Q3(A2, A3)
Q4(A3, A4)
Q1(A1, A4)
maxa
+ + +
Use variable elimination for maximization:
Limited communication for optimal action choice
Comm. bandwidth = tree-width of coord. graph
A1
A3
A2 A4
2Q
3Q 4Q
1Q
),(),(),(max 421212411,, 321
AAgAAQAAQA A A
),(),(max),(),(max 434323212411,, 3321
AAQAAQAAQAAQAA A A
),(),(),(),(max 434323212411,,, 4321
AAQAAQAAQAAQA A A A
A2 A4 Value of optimal A3
action
Attack Attack 5
Attack Defend
6
Defend
Attack 8
Defend
Defend
12
Coordination Graph Example
A4
A1
A3
A2
A7
A5
A6
A11
A9
A8
A10
Trees don’t increase communication requirements
Cycles require graph triangulation
Unified View: Function Approximation Multiagent Coordination
Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2)
+
Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4)
A1
A3
A2 A4
Q1(A1, X1) + Q2(A2, X2) +
Q3(A3, X3) + Q4(A4, X4)
A1
A3
A2 A4
Factored MDP and value function representations induce communication, coordination
Tradeoff Communication / Accuracy
How good are the policies?
SysAdmin problem
Power grid problem [Schneider et al. ‘99]
SysAdmin Ring - Quality of Policies
1.5
2.5
3.5
4.5
0 5 10number of machines
va
lue
pe
r m
ac
hin
e
Utopic maximum value
Exact solution
Constraint samplingSingle basis
Constraint samplingPair basis
Factored LP Single basis
Power Grid – Factored Multiagent
Lower is better!
[Guestrin, Lagoudakis, Parr ‘02]
0
10
20
30
40
50
60
70
80
90
100
A B C DGrid
Co
st
DR [Schneider+al '99]
DVF [Schneider+al '99]
Factored Multiagent no comm.
Factored Multiagent pairwise comm.
Summary of Algorithm
1. Pick local basis functions hi
2. Factored LP computes value function
3. Coordination graph computes argmaxa
of Q
Planning Complex Environments
When faced with a complex problem, exploit structure:
For planning For action selection
Given new problem
Replan from scratch: Different MDP New planning problem Huge problems intractable, even with factored LP
Generalizing to New Problems
SolveProblem 1
SolveProblem n
Good solution to
Problem n+1
SolveProblem 2
MDPs are different! Different sets of states, action, reward,
transition, …
Many problems are “similar”
Generalization with Relational MDPs
Avoid need to replan Tackle larger problems
[Guestrin, Koller, Gearhart, Kanodia ’03]
“Similar” domains have similar “types” of objects
Exploit similarities by computing generalizable value functions
RelationalMDP
Generalization
Relational Models and MDPs
Classes: Peasant, Gold, Wood, Barracks,
Footman, Enemy…
Relations Collects, Builds, Trains, Attacks…
Instances Peasant1, Peasant2, Footman1,
Enemy1…
Relational MDPs
Class-level transition probabilities depends on: Attributes; Actions; Attributes of
related objects Class-level reward function
P P’
AP
G
Gold
G’Collects
Very compact representation!Does not depend on # of objects
Peasant
Tactical Freecraft: Relational Schema
Enemy
H’ Health
R
Count
Footman
H’ Health
AFootmanmy_enemy
Enemy’s health depends on #footmen attacking Footman’s health depends on Enemy’s health
World is a Large Factored MDP
Instantiation (world): # instances of each class Links between instances
Well-defined factored MDP
RelationalMDP
Linksbetweenobjects
FactoredMDP
# of objects
World with 2 Footmen and 2 Enemies
F1.Health
F1.A
F1.H’
E1.Health E1.H’
F2.Health
F2.A
F2.H’
E2.Health E2.H’
R1
R2
Footman1
Enemy1
Enemy2
Footman2
World is a Large Factored MDP
Instantiate world Well-defined factored MDP Use factored LP for planning
We have gained nothing!
RelationalMDP
Linksbetweenobjects
FactoredMDP
# of objects
Class-level Value Functions
F1.Health E1.Health F2.Health E2.Health
Footman1
Enemy1
Enemy2
Footman2
VF1(F1.H, E1.H) VE1(E1.H) VF2(F2.H, E2.H) VE2(E2.H)
V(F1.H, E1.H, F2.H, E2.H) = + + + Units are Interchangeable!VF1 VF2 VF VE1 VE2
VE
At state x, each footman has different contribution to V
Given VC — can instantiate value function for any world
Footman1
Enemy1
Enemy2
Footman2
VF VF VE VE
Computing Class-level VC
:minimize
:subject to
, ax
),( xaQ)(xV
x
)(xV
C Co
CV )(][
x
C Co
CQ ),(][
ax
C Co
CV )(][
x
ax,,
Constraints for each world represented by factored LP
Number of worlds exponential or infinite
Sampling Worlds
Many worlds are similar Sample set I of worlds
, x, a I , x, aSampling
Theorem
Exponentially (infinitely) many worlds !
need exponentially many samples?NO!
samples
Value function within of class-level solution optimized for all worlds, with prob. at least 1-
Rmax is the maximum class reward Proof method related to [de Farias, Van Roy ‘02]
Learning Classes of Objects
1
23
4
23
3
4
510
10
20
30
40
50 GoodFaultyDead
V1
0
10
20
30
40
50 GoodFaultyDeadV2
0
20
40
60 GoodFaultyDead
V1
0
10
20
30
40
50 GoodFaultyDeadV2
Plan for sampled worlds
separately
Objects with similar values
belong to same class
Find regularitiesbetween worlds
Used decision tree regression in experiments
Summary of Algorithm
1. Model domain as Relational MDPs
2. Sample set of worlds
3. Factored LP computes class-level value
function for sampled worlds
4. Reuse class-level value function in new world
5. Coordination graph computes argmaxa of Q
Experimental Results
SysAdmin problem
Generalizing to New Problems
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
Ring Star Three legs
Est
imat
ed p
oli
cy v
alu
e p
er a
gen
t
Utopic maximum valueObject-based value with complete replanningClass-based value function - no replanning
Learning Classes of Objects
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Ring Star Three legs
Max
-no
rm e
rro
r o
f va
lue
fun
ctio
n
No class learning
Learnt classes
Classes of Objects Discovered
Learned 3 classes
Server
Intermediate
Intermediate
Intermediate
Leaf
LeafLeaf
Strategic
World: 2 Peasants, 2 Footmen,
1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 1 million state/action pairs
Algorithm: Solve with Factored LP Coordination graph for action
selection
Strategic
World: 9 Peasants, 3 Footmen,
1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs
Algorithm: Solve with factored LP Coordination graph for action
selection
grows exponentially in #
agents
Strategic
World: 9 Peasants, 3 Footmen,
1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs
Algorithm: Use generalized class-based value
function Coordination graph for action selection
instantiated Q-functionsgrow polynomially in #
agents
Tactical
Planned in 3 Footmen versus 3 Enemies
Generalized to 4 Footmen versus 4 Enemies
3 vs. 3 4 vs. 4
Generalize
Contributions Efficient planning with LP decomposition
[Guestrin, Koller, Parr ’01]
Multiagent action selection [Guestrin, Koller, Parr ’02]
Generalization to new environments [Guestrin, Koller, Gearhart, Kanodia ’03]
Variable coordination structure [Guestrin, Venkataraman, Koller ’02]
Multiagent reinforcement learning [Guestrin, Lagoudakis, Parr ’02] [Guestrin, Patrascu, Schuurmans ’02]
Hierarchical decomposition [Guestrin, Gordon ’02]
Open Issues
High tree-width problems
Basis function selection
Variable relational structure
Partial observability
Daphne Koller Committee
Leslie Kaelbling, Yoav Shoham, Claire Tomlin, Ben Van Roy
Co-authors
DAGS members Kristina and Friends My Family
M.S. Apaydin, D. Brutlag, F. Cozman, C. Gearhart, G. Gordon, D. Hsu, N. Kanodia, D. Koller, E. Krotkov, M. Lagoudakis, J.C. Latombe, D. Ormoneit,
R. Parr, R. Patrascu, D. Schuurmans, C. Varma, S. Venkataraman.
In planning problem –
Factored LP
ExploitStructure
In action selection –
Coord. graph
Between problems –
Generalization
Complex multiagent planning task
Conclusions
Formal framework for multiagent planningthat scales to very large problemsvery large
14436596542203275214816766492036822682859734670489954077831385060806196390977769687258235595095458210061891186534272525795367402762022519832080387801477422896484127439040011758861804112894781562309443806156617305408667449050617812548034440554705439703889581746536825491613622083026856377858229022846398307887896918556404084898937609373242171846359938695516765018940588109060426089671438864102814350385648747165832010614366132173102768902855220001
states
1322070819480806636890455259752
Network Management Problem
Ring
Star
Ring of Rings
k-grid
Computer runs processes
Computer status = {good, dead, faulty}
Dead neighbors increase dying probability
Reward for successful processes
Each SysAdmin takes local action = {reboot, not reboot }
Multiagent Policy QualityComparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]
3.4
3.6
3.8
4
4.2
4.4
2 4 6 8 10 12 14 16
Number of agents
Est
imat
ed v
alue
per
age
nt Utopic maximum value
Multiagent Policy QualityComparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]
3.4
3.6
3.8
4
4.2
4.4
2 4 6 8 10 12 14 16
Number of agents
Est
imat
ed v
alue
per
age
nt Utopic maximum value
Distributedreward
Distributedvalue
Multiagent Policy QualityComparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]
3.4
3.6
3.8
4
4.2
4.4
2 4 6 8 10 12 14 16
Number of agents
Est
imat
ed v
alue
per
age
nt Utopic maximum value
LP single basis
LP pair basis
Distributedreward
Distributedvalue
Comparing to Apricodd [Boutilier et al.]
y = 0.1473x3 - 0.8595x2 + 2.5006x - 1.5964R2 = 0.9997
y = 0.0254x2 + 0.0363x + 0.0725
R2 = 0.9983
0
10
20
30
40
50
6 8 10 12 14 16 18 20
Number of variables
Tim
e (
in s
eco
nd
s)
Apricodd
Rule-based
Apricodd: Exploits context-specific independence (CSI)
Factored LP: Exploits CSI and linear independence
y = 5.275x3 - 29.95x2 + 53.915x - 28.83
R2 = 1
0
100
200
300
400
500
6 8 10 12
Number of variables
Tim
e
(in
se
con
ds)
Apricodd
Rule-based
y = 3E-05 * 2 - 0.0026 * 2 + 5.6737R2 = 0.9999
x x2
Appricodd
0
10
20
30
40
50
60
0 2 4 6 8 10 12
Number of machines
Ru
nn
ing
tim
e (
min
ute
s)
Rule-based LP
Apricodd
0
5
10
15
20
25
30
0 2 4 6 8 10 12
Number of machines
Dis
cou
nte
d v
alu
e o
f p
olic
y (a
vg.
50
ru
ns
of
10
0 s
tep
s) Rule-based LP
Apricodd
0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8 10 12
Number of machines
Ru
nn
ing
tim
e (
min
ute
s)
Rule-based LP
Apricodd
Ring
Star
0
5
10
15
20
25
30
0 2 4 6 8 10 12
Number of machines
Dis
cou
nte
d v
alu
e o
f p
olic
y (a
vg.
50
ru
ns
of
10
0 s
tep
s)
Rule-based LP
Apricodd