23
Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University

Distributed Planning in Hierarchical Factored MDPs

  • Upload
    lenci

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Distributed Planning in Hierarchical Factored MDPs. Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University. Multiagent Coordination Examples. Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control. - PowerPoint PPT Presentation

Citation preview

Page 1: Distributed Planning in Hierarchical Factored MDPs

Distributed Planning in Hierarchical Factored

MDPs

Carlos GuestrinStanford University

Geoffrey GordonCarnegie Mellon University

Page 2: Distributed Planning in Hierarchical Factored MDPs

Multiagent Coordination Examples

Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control

Access only local information Distributed Control Distributed Planning

Page 3: Distributed Planning in Hierarchical Factored MDPs

Hierarchical Decomposition

EngineSteering

Chassis

ExhaustInjection

Cylinders

Part-of Part-of

Subsystems can share variables Each subsystem only observes its local variables Parallel decomposition ! exponential state space

Page 4: Distributed Planning in Hierarchical Factored MDPs

Outline Object-based Representation

Hierarchical Factored MDPs

Distributed planning Message passing algorithm based on LP

decomposition

Hierarchical action selection mechanism Limited observability and communication

Reusing plans and computation Exploit classes of objects

Page 5: Distributed Planning in Hierarchical Factored MDPs

R

G

Basic Subsystem MDP

I’

Internalvariables

I

Speed control

S

Externalvariables

Actions

Subsystem j decomposed: Internal variables Xj

External variables Yj

Actions Aj

Subsystem model: Rewards - Rj(Xj , Yj , Aj)

Transitions - Pj (Xj’ | Xj , Yj , Aj)

Subsystem can be modeled with any representation

Page 6: Distributed Planning in Hierarchical Factored MDPs

Hierarchical Subsystem Tree

Subsystem tree: Nodes are subsystems Hierarchical decomposition Tree reward = sum subsystem rewards

Consistent subsystem tree: Running intersection property Consistent dynamics

Lemma: consistent subsystem tree yields well-defined global MDP

M2 Speed control

I

GS

M1 Transmission

GC

M3 Cooling

F T

SepSet[M2]: {G , }

SepSet[M3]: {}

Page 7: Distributed Planning in Hierarchical Factored MDPs

Relationship to Factored MDPs

A2

A1

X1

R1

X3

X2

X’3

X’2

X’1

h2

h1

R2

R3

A2

A1

X3

X2

X’3

X’2

R2

R3

A1

X1

R1

X2 X’2

X’1

X1

X2X1A1

M2

M1

SepSet[M2]

Multiagent Factored MDP [Guestrin et al. ’01] Hierarchical Factored MDP

Representational power equivalent Hierarchical factored MDP multiagent

factored MDP with particular choice of basis functions

New capabilities Fully distributed planning algorithm Reuse for knowledge representation Reuse of computation

MDP counterpart to Object-Oriented Bayes Nets (OOBNs) [Koller and Pfeffer ’97]

Page 8: Distributed Planning in Hierarchical Factored MDPs

Planning for Hierarchical Factored MDPs

Action space: joint action a= {a1,…, an} for all subsystems

State space: joint state x of entire system Reward function: total reward r

Action and state spaces are exponential in # subsystems

Exploit hierarchical structure Efficient, distributed approximate planning algorithm Simple message passing approach Each subsystem accesses only its local model Each local model solved by any standard MDP algorithm

Page 9: Distributed Planning in Hierarchical Factored MDPs

Solving MDPs as LPs

Bellman constraint: if x a y with reward r,

V(x) V(y) + r = Q(a, x) Similarly for stochastic transitions Optimal V* satisfies all Bellman constraints,

and is componentwise smallest

min V(x)+V(y)+V(z)+V(g) st

V(x) V(y)+1V(y) V(g)+3V(x) V(z)+2V(z) V(g)+1

Page 10: Distributed Planning in Hierarchical Factored MDPs

Linear combination of restricted domain functions [Bellman et al. ’63][Schweitzer & Seidmann ’85][Tsitsiklis & Van Roy ’96][Koller & Parr ’99,’00][Guestrin et al. ’01]

Decomposable Value Functions

Each hi is status of small part(s) of a complex system: Status of a machine and neighbors Load on machine

Must find w giving good approximate value function

Well-designed hi exponentially fewer parameters

i iihwV )()(

~xx

Page 11: Distributed Planning in Hierarchical Factored MDPs

Approximate Linear Programming

To solve subsystem tree MDP as LP Overall state is cross-product of subsystem states Bellman LP has exponentially many constraints, variables

we need to approximate Write V(x) = V1(X1) + V2(X2) + ...

Minimize V1(X1) + V2(X2) + ... s.t.

V1(X1) + V2(X2) + ... V1(Y1) + V2(Y2) + ...

+ R1 + R2 + ...

One variable Vi(Xi) for each state of each subsys One constraint for every state and action Vi , Qi depend on small sets of variables/actions

Generates polynomially-sized LPs for factored MDPs [Guestrin et al. ‘01]

Page 12: Distributed Planning in Hierarchical Factored MDPs

Overview of Algorithm Each subsystem solves a

local (stand-alone) MDP

Each subsystem computes messages by solving a simple local LP:

Sends `constraint message’ to its parent

Sends `reward messages’ to its children

Repeat until convergence

Mj

Mk

… …

… …

Ml

Rewardmessage

Rewardmessage

Constraintmessage

Constraintmessage

Page 13: Distributed Planning in Hierarchical Factored MDPs

Stand-alone MDPs and Reward Messages

State – (Xj , Yj)

Actions – Aj

Rewards – Rj(Xj , Yj , Aj)

Transitions –

Pj (Xj’ | Xj , Yj , Aj)

Subsystem MDPReward

messages Sj from parent

Sk to children

State – Xj

Actions – (Aj , Yj)

Rewards – Rj(Xj ,

Yj , Aj) – Sj + k Sk

Transitions –

Pj (Xj’ | Xj , Yj , Aj)

Stand-alone MDP

Reward messages are over SepSets Solve stand-alone MDP using any algorithm Obtain visitation frequencies of resulting policy:

j = discounted frequency of visits to each state-action

Page 14: Distributed Planning in Hierarchical Factored MDPs

Visitation Frequencies

Dual

Discounted frequency of visits to each state action pairs:

Subsystems must agree on the frequency for shared variables ! reward messages

Approx. ! relaxed enforcement of constraints

M2 Speed control

I

GS

Page 15: Distributed Planning in Hierarchical Factored MDPs

Overview of Algorithm: Detailed

Mj

Mk

… …

… …

Ml

Each subsystem solves a local (stand-alone) MDP

Compute local visitation frequencies j

Add constraint to reward message LP

Each subsystem computes messages by solving a simple local LP:

Sends `constraint message’ to its parent – visitation frequencies for SepSet variables

Sends `reward messages’ to its children

Repeat until convergence

Page 16: Distributed Planning in Hierarchical Factored MDPs

Reward Message LP

Dual

LP yields reward messages Sk for children Dual yields mixing weights pj , pk enforce consistent frequencies

Page 17: Distributed Planning in Hierarchical Factored MDPs

Computing Reward Messages

Rows of jj and Lj correspond to visitation frequencies and value of each policy visited by Mj

Rows of jk are frequencies marginalized to SepSet[Mk]

Messages: Dual of reward message LP generates mixed policies pj and pk are mixing parameters, force parents and children to

agree on visitation of SepSet

Page 18: Distributed Planning in Hierarchical Factored MDPs

Convergence Result

Planning algorithm is a special case of nested Benders decomposition

One Benders split for each internal node N of subsystem tree

One subproblem is N itself Remaining subproblems are subtrees for N’s

children (decompose these recursively) Master prob is to determine reward messages

Result follows from correctness of Benders decomposition

Mj

Ml

Rewardmessage

Constraintmessage

In finite number of iterations, algorithm produces best possible value function

(ie, same as centralized planner)

Page 19: Distributed Planning in Hierarchical Factored MDPs

Hierarchical Action Selection

Mj

Mk

… …

… …

Ml

Actionchoice

Actionchoice

Value ofconditional

policy

Value of conditional

policy

Distributed planning obtains value function

Distributed message passing obtains action choice (policy)

Sends conditional value to its parent

Sends action choice to its children

Limited observability Limited communication

Page 20: Distributed Planning in Hierarchical Factored MDPs

Reusing Models and Computation

Classes of objects Basic subsystems with same rewards and transitions

Reuse in knowledge representation Library of subsystems

Reusing computation Compute policy (visitation frequencies) for one

subsystem, use it in all subsystems of the same class

Compute messages for one subtree, use them in all equivalent subtrees

Page 21: Distributed Planning in Hierarchical Factored MDPs

Related Work

Serial decompositions one subsystem “active” at a time Kushner & Chen ’74 (rooms in a maze) Dean & Lin, IJCAI-95 (combines w/ abstraction) hierarchical is similar (MAXQ, HAM, etc.)

Parallel decompositions more expressive (exponentially larger state

space) Singh & Cohn, NIPS-98 (enumerates states) Meuleau et al., AAAI-98 (heuristic for resources)

Page 22: Distributed Planning in Hierarchical Factored MDPs

Related Work

Dantzig-Wolfe or Benders decomposition Dantzig ’65 first used for MDPs in Kushner & Chen ’74 we are first to apply to parallel subsystems

Variable elimination well-known from Bayes nets Guestrin, Koller & Parr NIPS-01

Page 23: Distributed Planning in Hierarchical Factored MDPs

Summary – Hierarchical Factored MDPs

Parallel decomposition ! Exponential state space Efficient distributed planning algorithm

Solve local stand-alone MDPs with any algorithm Reward sharing coordinate subsystem plans Simple message passing algorithm computes rewards

Hierarchical action selection Limited communication Limited observability

Reuse for knowledge representation and computation

General approach for modeling and planning in large stochastic systems