Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University

Multiagent Planning with Factored MDPs

Carlos Guestrin

Stanford University

Collaborative Multiagent Planning

Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control

Long-termgoals

Multiple agents

Coordinateddecisions

CollaborativeMultiagentPlanning

Exploiting Structure

Real-world problems have:

Hundreds of objects Googles of states

Real-world problems have structure!

Approach: Exploit structured representation to obtain efficient approximate solution

peasant

footman

building

Real-time Strategy GamePeasants collect resources and buildFootmen attack enemiesBuildings train peasants and footmen

Joint Decision Space

State space: Joint state x of entire system

Action space: Joint action a= {a1,…, an} for all agents

Reward function: Total reward R(x,a)

Transition model: Dynamics of the entire system P(x’|x,a)

Markov Decision Process (MDP) Representation:

Policy

Policy: (x) = aAt state x, action a for all agents

(x0) = both peasants get woodx0

(x1) = one peasant gets gold, other builds barrack

x1

(x2) = Peasants get gold, footmen attack

x2

Value of Policy

Value: V(x)Expected long-term

reward starting from

xStart from x0

x0

R(x0)

(x0

)

V(x0) = E[R(x0) + R(x1) + 2 R(x2) + 3 R(x3) + 4 R(x4) + ]

Future rewards discounted by 2 [0,1)x1

R(x1)

x1’’

x1’R(x1’)

R(x1’’)

(x1

)x2

R(x2)

(x2

)x3

R(x3)

(x3

) x4

R(x4)

(x1’)

(x1’’)

Optimal Long-term Plan

Optimal Policy: *(x)

Optimal value function V*(x)

'

)'(),|'(),(max)(x

axaxxaxx VPRV

Optimal policy:)a,x(maxarg)x(

a

Q

Bellman Equations:

'

)'(),|'(),(),(x

xaxxaxax VPRQ

Solving an MDP

Policy iteration [Howard ’60, Bellman ‘57]

Value iteration [Bellman ‘57]

Linear programming [Manne ’60]

…

Solve Bellman equation

Optimal value V*(x)

Optimal policy *(x)

Many algorithms solve the Bellman equations:

LP Solution to MDP

Value computed by linear programming:

One variable V (x) for each state One constraint for each state x and action a Polynomial time solution

[Manne ’60]

),(

:subject to

:minimize

, ax

xa

x

Q)(xV

)(xV )(xV

, ax)(xV

Planning under Bellman’s “Curse”

Planning is Polynomial in #states and #actions

#states exponential in number of variables

#actions exponential in number of agents

Efficient approximation by exploiting structure!

F’

E’

G’

P’

Structure in Representation: Factored MDP

State Dynamics Decisions Rewards

Peasant

Footman

Enemy

Gold

RComplexity of representation:Exponential in #parents (worst

case)

[Boutilier et al. ’95]t t+1TimeAPeasant

ABuild

AFootman

P(F’|F,G,AB,AF)

Structured Value function ?Factored MDP Structure in V*

Y’’

Z’’

X’’

R

Y’’’

Z’’’

X’’’

Time t t+1

R

Y’

Z’

X’

t+2 t+3

R

Z

Y

X

R

Factored MDP Structure in V*

Almost!

Structured V yields

good approximate value

function

?

Linear combination of restricted domain functions [Bellman et al. ‘63][Tsitsiklis & Van Roy ’96][Koller & Parr ’99,’00][Guestrin et al. ’01]

Structured Value Functions

Each hi is status of small part(s) of a complex system: State of footman and enemy Status of barracks Status of barracks and state of footman

Structured V Structured Q

Must find w giving good approximate value function

i

ihiwV )()(~

xx

i

iQQ~

small #of Ai’s, Xj’s

Approximate LP Solution

:subject to

, ax

:minimize x

),( xaQ)(xV

)(xV

),( xa

iiQ)( x

iii hw

)( xi

ii hw

One variable wi for each basis function Polynomial number of LP variables

One constraint for every state and action Exponentially many LP constraints

)( xi

iihw

)( xi

iihw

, ax

[Schweitzer and Seidmann ‘85]

,),()( :subject to

axxaxi

ii

ii Qhw

Representing Exponentially Many Constraints

)x()x,a( :to subject max0x,a

i

iii hwQ

Exponentially many linear = one nonlinear constraint

,)(),(0 :subject to

axxxai

iii hwQ

[Guestrin, Koller, Parr ’01]

Maximization over exponential space

)x()x,a( :to subject max0x,a

i

iii hwQ

Variable Elimination

i

iii hwQ )x()x,a(maxx,a

A

D

B C

1f

4f 3f

2f

Here we need only 23, instead of 63 sum operations

),(),(),(max 121,,

CBgCAfBAfCBA

),(),(max),(),(max 4321,,

DBfDCfCAfBAfDCBA

),(),(),(),(max 4321,,,

DBfDCfCAfBAfDCBA

Variable elimination to maximize over state space [Bertele & Brioschi ‘72]

Maximization only exponential in largest factor Tree-width characterizes complexity

Graph-theoretic measure of “connectedness” Arises in many settings: integer prog., Bayes nets, comput. geometry, …

Structured Value Function

i

iii

XXAA

hwQ

m

n

)(),(

,,,,

1

1

max

small #of Ai’s, Xj’s

small #of Xj’s

Representing the Constraints

Use Variable Elimination to represent constraints:

),(),(max),(),(max0 4321,,

DBfDCfCAfBAfDCBA

),(),(

),(),(max0

43),(

1

),(121

,,

DBfDCfg

gCAfBAf

CB

CB

CBA

Number of constraints exponentially smaller!

)x()x,a(max0 :to subjectx,a

i

iii hwQ i


i


Understanding Scaling Properties

Explicit LP Factored LP

k = tree-width

2n (n+1-k)2k

Explicit LP

0

10000

20000

30000

40000

2 4 6 8 10 12 14 16number of variables

nu

mb

er o

f co

nst

rain

ts

Factored LP

k = 3

k = 5

k = 8

k = 10

k = 12

Network Management Problem

Ring

Star

Ring of Rings

k-grid

Computer status = {good, dead, faulty}

Dead neighbors increase dying probability

Computer runs processes

Reward for successful processes

Each SysAdmin takes local action = {reboot, not reboot }

Problem with n machines 9n states, 2n actions

Running Time

0

500

1000

1500

2000

2500

3000

0 2 4 6 8 10 12

number of machines

Ru

nn

ing

tim

e (

s)

RingExact solution

RingSingle basis k=4

StarSingle basis

k=4

3-gridSingle basis

k=5

StarPair basis

k=4RingPair basis

k=8

k – tree-width

Summary of Algorithm

1. Pick local basis functions hi

2. Factored LP computes value function

3. Policy is argmaxa of Q

Large-scale Multiagent Coordination

Efficient algorithm computes V Action at state x is:

)a,x(maxarga

Q

#actions is exponential Complete observability Full communication

Distributed Q Function

Q(A1,…,A4, X1,…,X4)


2

3

4

1

Q4

≈

Q2(A1, A2, X1,X2)

Q4(A3, A4, X3,X4)

Q1(A1, A4,

X1,X4) Q3(A2, A3, X2,X3)+

++

Each agent maintains a part of the Q function

Distributed Q

function

Multiagent Action Selection

2

3

4

1

Q2(A1, A2, X1,X2)

Q4(A3, A4, X3,X4)

Q1(A1, A4,

X1,X4)

Q3(A2, A3, X2,X3)

Distributed Q

function

Instantiate current state x

Maximal action

argmaxa

Instantiate Current State x

2

3

4

1

Q2(A1, A2, X1,X2)

Q4(A3, A4, X3,X4)

Q1(A1, A4,

X1,X4)

Q3(A2, A3, X2,X3)

Q2(A1, A2)

Q3(A2, A3)

Q4(A3, A4)

Q1(A1, A4)

Observe only

X1 and X2


Limited observability: agent i only observes variables in Qi

Multiagent Action Selection

2

3

4

1

Distributed Q

function


Maximal action

argmaxa

Q2(A1, A2)

Q3(A2, A3)

Q4(A3, A4)

Q1(A1, A4)

Coordination Graph

Q2(A1, A2)

Q3(A2, A3)

Q4(A3, A4)

Q1(A1, A4)

maxa

+ + +

Use variable elimination for maximization:

Limited communication for optimal action choice

Comm. bandwidth = tree-width of coord. graph

A1

A3

A2 A4

2Q

3Q 4Q

1Q

),(),(),(max 421212411,, 321

AAgAAQAAQA A A

),(),(max),(),(max 434323212411,, 3321

AAQAAQAAQAAQAA A A

),(),(),(),(max 434323212411,,, 4321

AAQAAQAAQAAQA A A A

A2 A4 Value of optimal A3

action

Attack Attack 5

Attack Defend

6

Defend

Attack 8

Defend

Defend

12

Coordination Graph Example

A4

A1

A3

A2

A7

A5

A6

A11

A9

A8

A10

Trees don’t increase communication requirements

Cycles require graph triangulation

Unified View: Function Approximation Multiagent Coordination

Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2)

+

Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4)

A1

A3

A2 A4

Q1(A1, X1) + Q2(A2, X2) +

Q3(A3, X3) + Q4(A4, X4)

A1

A3

A2 A4

Factored MDP and value function representations induce communication, coordination

Tradeoff Communication / Accuracy

How good are the policies?

SysAdmin problem

Power grid problem [Schneider et al. ‘99]

SysAdmin Ring - Quality of Policies

1.5

2.5

3.5

4.5

0 5 10number of machines

va

lue

pe

r m

ac

hin

e

Utopic maximum value

Exact solution

Constraint samplingSingle basis

Constraint samplingPair basis

Factored LP Single basis

Power Grid – Factored Multiagent

Lower is better!

[Guestrin, Lagoudakis, Parr ‘02]

0

10

20

30

40

50

60

70

80

90

100

A B C DGrid

Co

st

DR [Schneider+al '99]

DVF [Schneider+al '99]

Factored Multiagent no comm.

Factored Multiagent pairwise comm.


1. Pick local basis functions hi

2. Factored LP computes value function

3. Coordination graph computes argmaxa

of Q

Planning Complex Environments

When faced with a complex problem, exploit structure:

For planning For action selection

Given new problem

Replan from scratch: Different MDP New planning problem Huge problems intractable, even with factored LP

Generalizing to New Problems

SolveProblem 1

SolveProblem n

Good solution to

Problem n+1

SolveProblem 2

MDPs are different! Different sets of states, action, reward,

transition, …

Many problems are “similar”

Generalization with Relational MDPs

Avoid need to replan Tackle larger problems

[Guestrin, Koller, Gearhart, Kanodia ’03]

“Similar” domains have similar “types” of objects

Exploit similarities by computing generalizable value functions

RelationalMDP

Generalization

Relational Models and MDPs

Classes: Peasant, Gold, Wood, Barracks,

Footman, Enemy…

Relations Collects, Builds, Trains, Attacks…

Instances Peasant1, Peasant2, Footman1,

Enemy1…

Relational MDPs

Class-level transition probabilities depends on: Attributes; Actions; Attributes of

related objects Class-level reward function

P P’

AP

G

Gold

G’Collects

Very compact representation!Does not depend on # of objects

Peasant

Tactical Freecraft: Relational Schema

Enemy

H’ Health

R

Count

Footman

H’ Health

AFootmanmy_enemy

Enemy’s health depends on #footmen attacking Footman’s health depends on Enemy’s health

World is a Large Factored MDP

Instantiation (world): # instances of each class Links between instances

Well-defined factored MDP

RelationalMDP

Linksbetweenobjects

FactoredMDP

# of objects

World with 2 Footmen and 2 Enemies

F1.Health

F1.A

F1.H’

E1.Health E1.H’

F2.Health

F2.A

F2.H’

E2.Health E2.H’

R1

R2

Footman1

Enemy1

Enemy2

Footman2

World is a Large Factored MDP

Instantiate world Well-defined factored MDP Use factored LP for planning

We have gained nothing!

RelationalMDP

Linksbetweenobjects

FactoredMDP

# of objects

Class-level Value Functions

F1.Health E1.Health F2.Health E2.Health

Footman1

Enemy1

Enemy2

Footman2

VF1(F1.H, E1.H) VE1(E1.H) VF2(F2.H, E2.H) VE2(E2.H)

V(F1.H, E1.H, F2.H, E2.H) = + + + Units are Interchangeable!VF1 VF2 VF VE1 VE2

VE

At state x, each footman has different contribution to V

Given VC — can instantiate value function for any world

Footman1

Enemy1

Enemy2

Footman2

VF VF VE VE

Computing Class-level VC

:minimize

:subject to

, ax

),( xaQ)(xV

x

)(xV

C Co

CV )(][

x

C Co

CQ ),(][

ax

C Co

CV )(][

x

ax,,

Constraints for each world represented by factored LP

Number of worlds exponential or infinite

Sampling Worlds

Many worlds are similar Sample set I of worlds

, x, a I , x, aSampling

Theorem

Exponentially (infinitely) many worlds !

need exponentially many samples?NO!

samples

Value function within of class-level solution optimized for all worlds, with prob. at least 1-

Rmax is the maximum class reward Proof method related to [de Farias, Van Roy ‘02]

Learning Classes of Objects

1

23

4

23

3

4

510

10

20

30

40

50 GoodFaultyDead

V1

0

10

20

30

40

50 GoodFaultyDeadV2

0

20

40

60 GoodFaultyDead

V1

0

10

20

30

40

50 GoodFaultyDeadV2

Plan for sampled worlds

separately

Objects with similar values

belong to same class

Find regularitiesbetween worlds

Used decision tree regression in experiments


1. Model domain as Relational MDPs

2. Sample set of worlds

3. Factored LP computes class-level value

function for sampled worlds

4. Reuse class-level value function in new world

5. Coordination graph computes argmaxa of Q

Experimental Results

SysAdmin problem

Generalizing to New Problems

3

3.2

3.4

3.6

3.8

4

4.2

4.4

4.6

Ring Star Three legs

Est

imat

ed p

oli

cy v

alu

e p

er a

gen

t

Utopic maximum valueObject-based value with complete replanningClass-based value function - no replanning

Learning Classes of Objects

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Ring Star Three legs

Max

-no

rm e

rro

r o

f va

lue

fun

ctio

n

No class learning

Learnt classes

Classes of Objects Discovered

Learned 3 classes

Server

Intermediate

Intermediate

Intermediate

Leaf

LeafLeaf

Strategic

World: 2 Peasants, 2 Footmen,

1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 1 million state/action pairs

Algorithm: Solve with Factored LP Coordination graph for action

selection

Strategic


1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs

Algorithm: Solve with factored LP Coordination graph for action

selection

grows exponentially in #

agents

Strategic


1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs

Algorithm: Use generalized class-based value

function Coordination graph for action selection

instantiated Q-functionsgrow polynomially in #

agents

Tactical

Planned in 3 Footmen versus 3 Enemies

Generalized to 4 Footmen versus 4 Enemies

3 vs. 3 4 vs. 4

Generalize

Contributions Efficient planning with LP decomposition


Multiagent action selection [Guestrin, Koller, Parr ’02]

Generalization to new environments [Guestrin, Koller, Gearhart, Kanodia ’03]

Variable coordination structure [Guestrin, Venkataraman, Koller ’02]

Multiagent reinforcement learning [Guestrin, Lagoudakis, Parr ’02] [Guestrin, Patrascu, Schuurmans ’02]

Hierarchical decomposition [Guestrin, Gordon ’02]

Open Issues

High tree-width problems

Basis function selection

Variable relational structure

Partial observability

Daphne Koller Committee

Leslie Kaelbling, Yoav Shoham, Claire Tomlin, Ben Van Roy

Co-authors

DAGS members Kristina and Friends My Family

M.S. Apaydin, D. Brutlag, F. Cozman, C. Gearhart, G. Gordon, D. Hsu, N. Kanodia, D. Koller, E. Krotkov, M. Lagoudakis, J.C. Latombe, D. Ormoneit,

R. Parr, R. Patrascu, D. Schuurmans, C. Varma, S. Venkataraman.

In planning problem –

Factored LP

ExploitStructure

In action selection –

Coord. graph

Between problems –

Generalization

Complex multiagent planning task

Conclusions

Formal framework for multiagent planningthat scales to very large problemsvery large

14436596542203275214816766492036822682859734670489954077831385060806196390977769687258235595095458210061891186534272525795367402762022519832080387801477422896484127439040011758861804112894781562309443806156617305408667449050617812548034440554705439703889581746536825491613622083026856377858229022846398307887896918556404084898937609373242171846359938695516765018940588109060426089671438864102814350385648747165832010614366132173102768902855220001

states

1322070819480806636890455259752

Network Management Problem

Ring

Star

Ring of Rings

k-grid

Computer runs processes

Computer status = {good, dead, faulty}

Dead neighbors increase dying probability

Reward for successful processes

Each SysAdmin takes local action = {reboot, not reboot }

Multiagent Policy QualityComparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]

3.4

3.6

3.8

4

4.2

4.4

2 4 6 8 10 12 14 16

Number of agents

Est

imat

ed v

alue

per

age

nt Utopic maximum value


3.4

3.6

3.8

4

4.2

4.4

2 4 6 8 10 12 14 16

Number of agents

Est

imat

ed v

alue

per

age


Distributedreward

Distributedvalue


3.4

3.6

3.8

4

4.2

4.4

2 4 6 8 10 12 14 16

Number of agents

Est

imat

ed v

alue

per

age


LP single basis

LP pair basis

Distributedreward

Distributedvalue

Comparing to Apricodd [Boutilier et al.]

y = 0.1473x3 - 0.8595x2 + 2.5006x - 1.5964R2 = 0.9997

y = 0.0254x2 + 0.0363x + 0.0725

R2 = 0.9983

0

10

20

30

40

50

6 8 10 12 14 16 18 20

Number of variables

Tim

e (

in s

eco

nd

s)

Apricodd

Rule-based

Apricodd: Exploits context-specific independence (CSI)

Factored LP: Exploits CSI and linear independence

y = 5.275x3 - 29.95x2 + 53.915x - 28.83

R2 = 1

0

100

200

300

400

500

6 8 10 12

Number of variables

Tim

e

(in

se

con

ds)

Apricodd

Rule-based

y = 3E-05 * 2 - 0.0026 * 2 + 5.6737R2 = 0.9999

x x2

Appricodd

0

10

20

30

40

50

60

0 2 4 6 8 10 12

Number of machines

Ru

nn

ing

tim

e (

min

ute

s)

Rule-based LP

Apricodd

0

5

10

15

20

25

30

0 2 4 6 8 10 12

Number of machines

Dis

cou

nte

d v

alu

e o

f p

olic

y (a

vg.

50

ru

ns

of

10

0 s

tep

s) Rule-based LP

Apricodd

0

5

10

15

20

25

30

35

40

45

50

0 2 4 6 8 10 12

Number of machines

Ru

nn

ing

tim

e (

min

ute

s)

Rule-based LP

Apricodd

Ring

Star

0

5

10

15

20

25

30

0 2 4 6 8 10 12

Number of machines

Dis

cou

nte

d v

alu

e o

f p

olic

y (a

vg.

50

ru

ns

of

10

0 s

tep

s)

Rule-based LP

Apricodd

Documents

Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University