Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Generalizing Plans to New Environments in Multiagent

Relational MDPs

Carlos Guestrin

Daphne KollerStanford University

Multiagent Coordination Examples

Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control

Multiple, simultaneous decisions Exponentially-large spaces Limited observability Limited communication

peasant

footman

building

Real-time Strategy GamePeasants collect resources and buildFootmen attack enemiesBuildings train peasants and footmen

Scaling up by Generalization

Exploit similarities between world elements

Generalize plans: From a set of worlds to a

new, unseen world Avoid need to replan Tackle larger problems

Formalize notion of “similar” elementsCompute generalizable plans

Relational Models and MDPs

Classes: Peasant, Gold, Wood, Barracks,

Footman, Enemy… Relations

Collects, Builds, Trains, Attacks… Instances

Peasant1, Peasant2, Footman1, Enemy1…

Value functions in class level Objects of the same class have same

contribution to value function Factored MDP equivalents of PRMs

[Koller, Pfeffer ‘98]

Relational MDPs

Class-level transition probabilities depends on: Attributes; Actions; Attributes of

related objects Class-level reward function Instantiation (world)

Number objects; Relations Well-defined MDP

Peasant

P’ P

AP

Gold

G’ GCollects

Planning in a World

Long-term planning by solving MDP # states exponential in number of objects # actions exponential

Efficient approximation by exploiting structure!

RMDP world is a factored MDP

Roadmap to Generalization

Solve 1 world

Compute generalizable value function

Tackle a new world

World is a Factored MDP

P

F

E

G

R

F’

E’

G’

P’

State Dynamics Decisions Rewards

P(F’|F,G,H,AF)H

AP

AF

Long-term Utility = Value of MDP

Value computed by linear programming:

,

),()( :subject to

)(:minimize

ax

xax

xx

QV

V

One variable V (x) for each state One constraint for each state x and action a Number of states and actions exponential!

[Manne `60]

Approximate Value Functions

Linear combination of restricted domain functions [Bellman et al. `63][Tsitsiklis & Van Roy `96][Koller & Parr `99,`00][Guestrin et al. `01]

Each Vo depends on state of object and related objects: State of footman Status of barracks

Must find Vo giving good approximate value function

o oVV )()(

~xx

Single LP Solution for Factored MDPs

Variables for each Vo , for each object Polynomially many LP variables

One constraint for every state and action Exponentially many LP constraints

Vo , Qo depend on small sets of variables/actions Exploit structure as in variable elimination

[Guestrin, Koller, Parr `01]

,

),()( :subject to

':minimize

ax

xaxo

oo

o

ooo

QV

V [Schweitzer and Seidmann ‘85]

,),()( :subject to

axxaxo

oo

o QV

Representing Exponentially Many Constraints

)(),( :subject to max0,

o

oo VQ xxaxa

Exponentially many linear = one nonlinear constraint

,)(),(0 :subject to

axxxao

oo VQ

Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72]

Variable Elimination

A

D

B C

1f

4f 3f

2f

As in Bayes nets, maximization is exponential in tree-width

Here we need only 23, instead of 63 sum operations

),(),(),(max 121,,

CBgCAfBAfCBA

),(),(max),(),(max 4321,,

DBfDCfCAfBAfDCBA

),(),(),(),(max 4321,,,

DBfDCfCAfBAfDCBA

Representing the Constraints

Functions are factored, use Variable Elimination to represent constraints:

),(),(max),(),(max0 4321,,

DBfDCfCAfBAfDCBA

),(),(

),(),(max0

43),(

1

),(121

,,

DBfDCfg

gCAfBAf

CB

CB

CBA

Number of constraints exponentially smaller

)(),( :subject to max0,

o

oo VQ xaxxa

Roadmap to Generalization

Solve 1 world

Compute generalizable value function

Tackle a new world

Generalization

Sample a set of worlds

Solve a linear program for these worlds: Obtain class value functions

When faced with new problem: Use class value function No re-planning needed

Worlds and RMDPs

Meta-level MDP:

Meta-level LP:

,,),,()(

)(),()( :subject to

)()(:minimize

0

0

axxax

xx

x

x

x

QV

VPxV

VxV

Class-level Value Functions

Approximate solution to meta-level MDP Linear approximation Value function defined in the class level All instances use same local value function

Class-level LP

,,

),()( :subject to

)()(:minimize

][][

][

ax

axx

xxx

c Coooc

c Cooc

c Coococ

QV

Vo

Constraints for each world represented by factored LP

Number of worlds exponential or infinite Sample worlds from P()

Theorem

Exponentially (infinitely) many worlds !

need exponentially many samples?NO!

samples

Value function within , with prob. at least 1-.

Rmax is the maximum class reward Proof method related to [de Farias, Van Roy ‘02]

LP with sampled worlds

,,

),()( :subject to

)()(:minimize

][][

][

I

QV

V

c Coooc

c Cooc

I c Coococ

o

ax

axx

xxx

Solve LP for sampled worlds Use Factored LP for each world Obtain class-level value function New world: instantiate value function and act

Learning Classes of Objects

Which classes of objects have same value function?

Plan for sampled worlds individually Use value function as “training data” Find objects with similar values Include features of world

Used decision tree regression in experiments

Summary of Generalization Algorithm

1. Model domain as Relational MDPs

2. Pick local object value functions Vo

3. Learn classes by solving some instances

4. Sample set of worlds

5. Factored LP computes class-level value

function

A New World When faced with a new world , value function

is:

Q function becomes:

At each state, choose action maximizing Q(x,a) Number of actions is exponential! Each QC depends only on a few objects!!!

Q(A1,…,A4, X1,…,X4) ¼

Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) +

Q3(A2, A3, X2,X3) + Q4(A3, A4,

X3,X4)

Local Q function Approximation

M4

M1

M3

M2

Q3

Q(A1,…,A4, X1,…,X4)

Associated with Agent 3

Limited observability: agent i only observes variables in Qi

Observe only X2 and X3

Must choose action to maximize i Qi

Use variable elimination for maximization: [Bertele & Brioschi ‘72]

Maximizing i Qi: Coordination

Graph

Limited communication for optimal action choice

Comm. bandwidth = induced width of coord. graph

A1

A4

A2 A3

1Q

4Q 3Q

2Q

),(),(),(max 321312211,, 321

AAgAAQAAQA A A

),(),(max),(),(max 424433312211,, 4321

AAQAAQAAQAAQAA A A

),(),(),(),(max 424433312211,,, 4321

AAQAAQAAQAAQA A A A

If A2 attacks and A3 defends,

then A4 gets $10

Summary of Algorithm

1. Model domain as Relational MDPs

2. Factored LP computes class-level value

function

3. Reuse class-level value function in new world

Experimental Results

SysAdmin problem

Unidirectional Ring

Server

StarRing of Rings

Generalizing to New Problems

3

3.2

3.4

3.6

3.8

4

4.2

4.4

4.6

Ring Star Three legs

Est

imat

ed p

oli

cy v

alu

e p

er a

gen

tClass-based value function'Optimal' approximate value functionUtopic maximum value


3

3.2

3.4

3.6

3.8

4

4.2

4.4

4.6


Est

imat

ed p

oli

cy v

alu

e p

er a

gen



3

3.2

3.4

3.6

3.8

4

4.2

4.4

4.6


Est

imat

ed p

oli

cy v

alu

e p

er a

gen


Classes of Objects Discovered

Learned 3 classes

Server

Intermediate

Intermediate

Intermediate

Leaf

LeafLeaf


0

0.2

0.4

0.6

0.8

1

1.2

1.4


Max

-no

rm e

rro

r o

f va

lue

fun

ctio

n No class learning

Learnt classes


0

0.2

0.4

0.6

0.8

1

1.2

1.4


Max

-no

rm e

rro

r o

f va

lue

fun

ctio

n No class learning

Learnt classes

Results 2 Peasants, Gold, Wood, Barracks, 2 Footman, Enemy

Reward for dead enemy

About 1 million of state/action pairs

Solve with Factored LP

Some factors are exponential

Coordination graph for action selection

[with Gearhart and Kanodia]

Generalization

9 Peasants, Gold, Wood, Barracks, 3 Footman, Enemy

Reward for dead enemy

About 3 trillion of state/action pairs

Instantiate generalizable value function

At run-time, factors are polynomial

Coordination graph for action selection

The 3 aspects of this talk

Scaling up collaborative multiagent planning Exploiting structure Generalization

Factored representation and algorithms Relational MDP, Factored LP, coordination graph

Freecraft as a benchmark domain

Conclusions RMDP

Compact representation for set of similar planning problems

Solve single instance with factored MDP algorithms

Tackle sets of problems with class-level value functions

Efficient sampling of worlds Learn classes of value functions

Generalization to new domains Avoid replanning Solve larger, more complex MDPs

Documents

Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University