View
216
Download
0
Embed Size (px)
Citation preview
Planning under Uncertainty with Markov Decision Processes:Lecture II
Craig Boutilier
Department of Computer Science
University of Toronto
2PLANET Lecture Slides (c) 2002, C. Boutilier
Recap
We saw logical representations of MDPs• propositional: DBNs, ADDs, etc.
• first-order: situation calculus
• offer natural, concise representations of MDPs
Briefly discussed abstraction as a general computational technique
• discussed one simple (fixed uniform) abstraction method that gave approximate MDP solution
• construction exploited logical representation
3PLANET Lecture Slides (c) 2002, C. Boutilier
Overview
We’ll look at further abstraction methods based on a decision-theoretic analog of regression
• value iteration as variable elimination
• propositional decision-theoretic regression
• approximate decision-theoretic regression
• first-order decision-theoretic regression
We’ll look at linear approximation techniques• how to construct linear approximations
• relationship to decomposition techniques
Wrap up
4PLANET Lecture Slides (c) 2002, C. Boutilier
Dimensions of Abstraction (recap)
A B C
A B C
A B C
A B C
A B C
A B C
A B C
A B C
A
A B C
A B
A B C
A
B
C=
5.3
5.3
5.3
5.3
2.9
2.9 9.3
9.3
5.3
5.2
5.5
5.3
2.9
2.79.3
9.0
Uniform
Nonuniform
Exact
Approximate
Adaptive
Fixed
5PLANET Lecture Slides (c) 2002, C. Boutilier
Classical Regression
Goal regression a classical abstraction method• Regression of a logical condition/formula G through
action a is a weakest logical formula C = Regr(G,a) such that: G is guaranteed to be true after doing a if C is true before doing a
• Weakest precondition for G wrt a
G
G
C
Cdo(a)
6PLANET Lecture Slides (c) 2002, C. Boutilier
Example: Regression in SitCalc
For the situation calculus• Regr(G(do(a,s))): logical condition C(s) under which a
leads to G (aggregates C states and ~C states)
Regression in sitcalc straightforward
• Regr(F(x, do(a,s))) F(x,a,s)• Regr(1) Regr(1)• Regr(12) Regr(1) Regr(2)• Regr(x.1) x.Regr(1)
7PLANET Lecture Slides (c) 2002, C. Boutilier
Decision-Theoretic Regression
In MDPs, we don’t have goals, but regions of distinct value
Decision-theoretic analog: given “logical description” of Vt+1, produce such a description of Vt or optimal policy (e.g., using ADDs)
Cluster together states at any point in calculation
with same best action (policy), or with same
value (VF)
8PLANET Lecture Slides (c) 2002, C. Boutilier
Decision-Theoretic RegressionDecision-theoretic complications:
• multiple formulae G describe fixed value partitions
• a can leads to multiple partitions (stochastically)
• so find regions with same “partition” probabilities
Qt(a) Vt-1
G2
G3G1
C1
p1
p2
p3
9PLANET Lecture Slides (c) 2002, C. Boutilier
Functional View of DTR
Generally, Vt-1 depends on only a subset of variables (usually in a structured way)What is value of action a at stage t (at any s)?
CR
M
-10 0
Vt-1
Tt
Lt
CRt
RHCt
Tt+1
Lt+1
CRt+1
RHCt+1
RHMt RHMt+1
Mt Mt+1
fRm(Rmt,Rmt+1)
fM(Mt,Mt+1)
fT(Tt,Tt+1)
fL(Lt,Lt+1)
fCr(Lt,Crt,Rct,Crt+1)
fRc(Rct,Rct+1)
10PLANET Lecture Slides (c) 2002, C. Boutilier
Functional View of DTRAssume VF Vt-1 is structured: what is value of doing action a (DelC) at time t ?
Qat(Rmt,Mt,Tt,Lt,Crt,Rct)
= R + Rm,M,T,L,Cr,Rc(t+1) Pra(Rmt-1,Mt-1,Tt-1,Lt-1,Crt-1,Rct-1 | Rmt,Mt,Tt,Lt,Crt,Rct) *
Vt-1(Rmt-1,Mt-1,Tt-1,Lt-1,Crt+1,Rct-1)
= R + Rm,M,T,L,Cr,Rc(t+1) fRm(Rmt,Rmt-1) fM(Mt,Mt-1) fT(Tt,Tt-1) fL(Lt,Lt-1) fCr(Lt,Crt,Rct,Crt-1)
fRc(Rct,Rct-1) Vt-1(Mt-1,Crt-1)
= R + M,Cr,Rc(t+1) fM(Mt,Mt-1) fCr(Lt,Crt,Rct,Crt-1) Vt-1(Mt-1,Crt-1)
= f(Mt,Lt,Crt,Rct)
11PLANET Lecture Slides (c) 2002, C. Boutilier
Functional View of DTR
Qt(a) depends only on a subset of variables• the relevant variables determined automatically by
considering variables mentioned in Vt-1 and their parents in DBN for action a
• Q-functions can be produced directly using VE
Notice also that these functions may be quite compact (e.g., if VF and CPTs use ADDs)
• we’ll see this again
12PLANET Lecture Slides (c) 2002, C. Boutilier
Planning by DTR
Standard DP algorithms can be implemented using structured DTRAll operations exploit ADD rep’n and algorithms
• multiplication, summation, maximization of functions
• standard ADD packages very fast
Several variants possible• MPI/VI with decision trees [BouDeaGol95,00; Bou97;
BouDearden96]
• MPI/VI with ADDs [HoeyStAubinHuBoutilier99, 00]
13PLANET Lecture Slides (c) 2002, C. Boutilier
Structured Value Iteration
Assume compact representation of Vk • start with R at stage-to-go 0 (say)
For each action a, compute Qk+1 using variable elimination on the two-slice DBN
• eliminate all k-variables, leaving only k+1 variables
• use ADD operations if initial rep’n allows
Compute Vk+1 = maxa Qk+1
• use ADD operations if initial representation allows
Policy iteration can be approached similarly
14PLANET Lecture Slides (c) 2002, C. Boutilier
Structured Policy and Value Function
DelC BuyC
GetU
Noop
U
R
W
Loc
Go
Loc
HCR
HCU
8.368.45
7.45
U
R
W
6.817.64
6.64
U
R
W
5.626.19
5.19
U
R
W
6.106.83
5.83
U
R
W
Loc Loc
HCR
HCU
9.00
W
10.00
15PLANET Lecture Slides (c) 2002, C. Boutilier
Structured Policy Evaluation: Trees
Assume a tree for V t, produce V t+1
For each distinction Y in Tree(V t ):a) use 2TBN to discover conditions affecting Y
b) piece together using the structure of Tree(V t )
Result is a tree exactly representing V t+1
• dictates conditions under which leaves (values) of Tree(V t ) are reached with fixed probability
16PLANET Lecture Slides (c) 2002, C. Boutilier
A Simple Action/Reward Example
X
Y
Z
X
Y
Z
X
Y0.9
0.0
X
1.0 0.0
1.0
Y
Z0.9
0.01.0
Z
10 0
Network Rep’n for Action A Reward Function R
17PLANET Lecture Slides (c) 2002, C. Boutilier
Example: Generation of V1
Z
010
V0 = R
Y
ZZ: 0.9
Z: 0.0Z: 1.0
Step 1
Y
Z9.0
0.010.0
Step 2
Y
Z8.1
0.019.0
Step 3: V1
18PLANET Lecture Slides (c) 2002, C. Boutilier
Example: Generation of V2
Y
Z8.1
0.019.0
V1
Y
X
Y
ZY: 0.9
Z: 0.9
Y: 0.9
Z: 0.0
Y:0.9
Z: 1.0
ZY: 1.0
Y: 0.0
Z: 0.0
Y:0.0
Z: 1.0
Step 1 Step 2
X
YY: 0.9
Y: 0.0Y: 1.0
19PLANET Lecture Slides (c) 2002, C. Boutilier
Some Results: Natural Examples
20PLANET Lecture Slides (c) 2002, C. Boutilier
A Bad Example for SPUDD/SPI
Action ak makes Xk true;
makes X1... Xk-1 false;
requires X1... Xk-1 true
Reward: 10 if allX1 ... Xn true(Value function forn = 3 is shown)
21PLANET Lecture Slides (c) 2002, C. Boutilier
Some Results: Worst-case
22PLANET Lecture Slides (c) 2002, C. Boutilier
A Good Example for SPUDD/SPI
Action ak makes Xk true;
requires X1... Xk-1 true
Reward: 10 if allX1 ... Xn true(Value function forn = 3 is shown)
23PLANET Lecture Slides (c) 2002, C. Boutilier
Some Results: Best-case
24PLANET Lecture Slides (c) 2002, C. Boutilier
DTR: Relative Merits
Adaptive, nonuniform, exact abstraction method• provides exact solution to MDP
• much more efficient on certain problems (time/space)
• 400 million state problems (ADDs) in a couple hrs
Some drawbacks• produces piecewise constant VF
• some problems admit no compact solution representation (though ADD overhead “minimal”)
• approximation may be desirable or necessary
25PLANET Lecture Slides (c) 2002, C. Boutilier
Approximate DTR
Easy to approximate solution using DTR
Simple pruning of value function
• Can prune trees [BouDearden96] or ADDs [StaubinHoeyBou00]
Gives regions of approximately same value
26PLANET Lecture Slides (c) 2002, C. Boutilier
A Pruned Value ADD
8.368.45
7.45
U
R
W
6.817.64
6.64
U
R
W
5.626.19
5.19
U
R
WLoc
HCR
HCU
9.00
W
10.00
[7.45, 8.45]
Loc
HCR
HCU
[9.00, 10.00]
[6.64, 7.64]
[5.19, 6.19]
27PLANET Lecture Slides (c) 2002, C. Boutilier
Approximate Structured VIRun normal SVI using ADDs/DTs
• at each leaf, record range of values
At each stage, prune interior nodes whose leaves all have values with some threshold
• tolerance can be chosen to minimize error or size• tolerance can be adjusted to magnitude of VF
Convergence requires some careIf max span over leaves < and term. tol. < :
1
22 )(* VV
28PLANET Lecture Slides (c) 2002, C. Boutilier
Approximate DTR: Relative Merits
Relative merits of ADTR• fewer regions implies faster computation• can provide leverage for optimal computation• 30-40 billion state problems in a couple hours• allows fine-grained control of time vs. solution quality
with dynamic (a posteriori) error bounds• technical challenges: variable ordering, convergence,
fixed vs. adaptive tolerance, etc.
Some drawbacks• (still) produces piecewise constant VF• doesn’t exploit additive structure of VF at all
29PLANET Lecture Slides (c) 2002, C. Boutilier
First-order DT Regression
DTR methods so far are propositional• extension to FO case critical for practical planning
First-order DTR extends existing propositional DTR methods in interesting ways
First let’s quickly recap the stochastic sitcalc specification of MDPs
30PLANET Lecture Slides (c) 2002, C. Boutilier
SitCal: Domain Model (Recap)
Domain axiomatization: successor state axioms
• one axiom per fluent F: F(x, do(a,s)) F(x,a,s)
These can be compiled from effect axioms• use Reiter’s domain closure assumption
')',()'(),,(
),(),()),(,,(
ccctdriveacsctTruckIn
stFueledctdriveasadoctTruckIn
))),,((,,()),,(( sctdrivedoctTruckInsctdrivePoss
31PLANET Lecture Slides (c) 2002, C. Boutilier
Axiomatizing Causal Laws (Recap)
),,()),,((
)),(),,((1
)),,(),,((
9.0)(7.0)(
)),,(),,((
),(),(
)),,((
stbOnstbunloadPoss
stbunloadtbunloadSprob
stbunloadtbunloadFprob
psRainpsRain
pstbunloadtbunloadSprob
tbunloadFatbunloadSa
atbunloadchoice
32PLANET Lecture Slides (c) 2002, C. Boutilier
Stochastic Action Axioms (Recap)
For each possible outcome o of stochastic action a(x), no(x) let denote a deterministic actionSpecify usual effect axioms for each no(x)
• these are deterministic, dictating precise outcome
For a(x), assert choice axiom• states that the no(x) are only choices allowed nature
Assert prob axioms• specifies prob. with which no(x) occurs in situation s• can depend on properties of situation s• must be well-formed (probs over the different
outcomes sum to one in each feasible situation)
33PLANET Lecture Slides (c) 2002, C. Boutilier
Specifying Objectives (Recap)
Specify action and state rewards/costs
),,(.0)(
),,(.10)(
sParisbInbsreward
sParisbInbsreward
5.0))),,((( sctdrivedoreward
34PLANET Lecture Slides (c) 2002, C. Boutilier
First-Order DT Regression: Input
Input: Value function Vt(s) described logically:• If 1 : v1 ; If 2 : v2 ; ... If k : vk
Input: action a(x) with outcomes n1(x),...,nm(x)• successor state axioms for each ni (x)• probabilities vary with conditions: 1 , ..., n
t.On(B,t,s) : 10 t.On(B,t,s) : 0
load(b,t)loadS(b,t) : On(b,t)
loadF(b,t) : -----
Rain ¬Rain0.7 0.9
0.3 0.1
35PLANET Lecture Slides (c) 2002, C. Boutilier
First-Order DT Regression: Output
Output: Q-function Qt+1(a(x),s) • also described logically: If 1 : q1 ; ... If k : qk
This describes Q-value for all states and for all instantiations of action a(x)
• state and action abstraction
We can construct this by taking advantage of the fact that nature’s actions are deterministic
36PLANET Lecture Slides (c) 2002, C. Boutilier
Step 1
Regress each i-nj pair: Regr(i,do(nj(x),s))
)t'.On(B,t's)))o(LF(b,t),t.On(B,t,d(grRe
)t'.On(B,t's)))o(LF(b,t),t.On(B,t,d(grRe
)t'.On(B,t'loc(t,s))loc(B,s)B(b
s)))o(LS(b,t),t.On(B,t,d(grRe
)t'.On(B,t'loc(t,s))loc(B,s)B(b
s)))o(LS(b,t),t.On(B,t,d(grRe
A.
B.
C.D.
37PLANET Lecture Slides (c) 2002, C. Boutilier
Step 2
Compute new partitions:
• k = i Regr(j(1),n1) ... Regr(j(m),nm)
• Q-value is: )()|Pr( )(ijmi
i Valn
0.7)),,((),',('.
),(),()(
:)(
stbloadQstBOnt
stlocsBlocBbsRain
DAsinRa
A: LoadS, pr =0.7,val=10
D: LoadF, pr =0.3,val=0
38PLANET Lecture Slides (c) 2002, C. Boutilier
Step 2: Graphical View
t.On(B,t,s) : 10
t.On(B,t,s) : 0
t.On(B,t,s) & Rain(s) & b=B & loc(b,s)=loc(t,s)
t.On(B,t,s)
(b=B v loc(b,s)=loc(t,s))& t.On(B,t,s)
t.On(B,t,s) & Rain(s) & b=B & loc(b,s)=loc(t,s)
10
7
9
0
1.0
0.7
0.1
0.9
0.3
1.0
39PLANET Lecture Slides (c) 2002, C. Boutilier
Step 2: With Logical Simplification
0
)),(),((),',('.
9),',('.
),(),()(
7),',('.
),(),()(
10),',('.
)),,((.,,
q
stlocsBlocBbstBOnt
qstBOnt
stlocsBlocBbsRain
qstBOnt
stlocsBlocBbsRain
qstBOnt
qstbloadQstb
40PLANET Lecture Slides (c) 2002, C. Boutilier
DP with DT Regression
Can compute Vt+1(s) = maxa {Qt+1(a,s)}
Note:Qt+1(a(x),s) may mention action properties• may distinguish different instantiations of a
Trick: intra-action and inter-action maximization• Intra-action: max over instantiations of a(x) to
remove dependence on action variables x
• Inter-action: max over different action schemata to obtain value function
41PLANET Lecture Slides (c) 2002, C. Boutilier
Intra-action MaximizationSort partitions of Qt+1(a(x),s) in order of value
• existentially quantify over x in each to get Qat+1(s)
• conjoin with negation of higher valued partitions
E.g., suppose Q(a(x),s) has partitions:• p(x,s) 1(s) : 10 p(x,s) 2(s) : 8
• p(x,s) 3(s) : 6 p(x,s) 4(s) : 4
Then we have the “pure state” Q-function:x. p(x,s) 1(s) : 10 x.p(x,s) 2(s) x.p(x,s) 1(s) : 8x. p(x,s) 3(s) x.[p(x,s) 1(s) p(x,s) 2(s)]: 6• …
42PLANET Lecture Slides (c) 2002, C. Boutilier
Intra-action Maximization Example
...7),',('.
),(),()(.,
9),',('.
),(),()(.,
10),',('.
)(.
qstBOnt
stlocsBlocBbsRaintb
qstBOnt
stlocsBlocBbsRaintb
qstBOnt
qsQs load
43PLANET Lecture Slides (c) 2002, C. Boutilier
Inter-action Maximization
Each action type has “pure state” Q-functionValue function computed by sorting partitions and conjoining formulae
vvvv
vvQ
vvQ
baba
bbbbb
aaaaa
2211
2211
2211
;:
;:
v
v
v
vV
bbaba
aaba
bba
aa
22211
2211
111
11
;
;:
44PLANET Lecture Slides (c) 2002, C. Boutilier
FODTR: Summary
Assume logical rep’n of value function Vt(s) • e.g., V0(s) = R(s) grounds the process
Build logical rep’n of Qt+1(a(x),s) for each a(x)• standard regression on nature’s actions• combine using probabilities of nature’s choices• add reward function, discounting if necessary
Compute Qat+1(s) by intra-action maximization
Compute Vt+1(s) = maxa {Qat+1(s)}
Iterate until convergence
45PLANET Lecture Slides (c) 2002, C. Boutilier
FODTR: Implementation
Implementation does not make procedural distinctions described
• written in terms of logical rewrite rules that exploit logical equivalences: regression to move back states, definition of Q-function, definition of value function
• (incomplete) logical simplification achieved using theorem prover (LeanTAP)
Empirical results are fairly preliminary, but gradient is encouraging
46PLANET Lecture Slides (c) 2002, C. Boutilier
Example Optimal Value Function
0)]],,(),,(.[,,)([
),,(.,),,(.
26.1)],,(),,(.[,
),,(.,),,(.)(
52.1),,(.),,(.,
)],,(),,(.[,,)(
53.2)],,(),,(.[,
),,(.,),,(.)(
29.4)],,(),,(.[,
),,(.)(
56.5)],,(),,(.[,
),,(.)(10),,(.
stcAtsbcInctbsRain
stbOntbsbParisInb
stParisAtstbOntb
stbOntbsbParisInbsRain
sbParisInbstbOntb
stcAtsbcInctbsRain
stParisAtstbOntb
stbOntbsbParisInbsRain
stParisAtstbOntb
sbParisInbsRain
stParisAtstbOntb
sbParisInbsRainsbParisInb
47PLANET Lecture Slides (c) 2002, C. Boutilier
Benefits of F.O. Regression
Allows standard DP to be applied in large MDPs• abstracts state space (no state enumeration)
• abstracts action space (no action enumeration)
DT Regression fruitful in propositional MDPs• we’ve seen this in SPUDD/SPI
• leverage for: approximate abstraction; decomposition
We’re hopeful that FODTR will exhibit the same gains and morePossible use in DTGolog programming paradigm
48PLANET Lecture Slides (c) 2002, C. Boutilier
Function Approximation
Common approach to solving MDPs• find a functional form f()for VF that is tractable
e.g., not exponential in number of variables• attempt to find parameters s.t. f() offers “best fit”
to “true” VF
Example:• use neural net to approximate VF
inputs: state features; output: value or Q-value• generate samples of “true VF” to train NN
e.g., use dynamics to sample transitions and train on Bellman backups (bootstrap on current approximation given by NN)
49PLANET Lecture Slides (c) 2002, C. Boutilier
Linear Function Approximation
Assume a set of basis functions B = { b1 ... bk }
• each bi : S → generally compactly representible
A linear approximator is a linear combination of these basis functions; for some weight vector w :
Several questions:• what is best weight vector w ?
• what is a “good” basis set B ?
• what does this buy us computationally?
)()( sbi wsV ii
50PLANET Lecture Slides (c) 2002, C. Boutilier
Flexibility of Linear Decomposition
Assume each basis function is compact• e.g., refers only a few vars; b1(X,Y), b2(W,Z), b3(A)
Then VF is compact:• V(X,Y,W,Z,A) = w1 b1(X,Y) + w2 b2(W,Z) + w3 b3(A)
For given representation size (10 parameters), we get more value flexibility (32 distinct values) compared to a piecewise constant rep’nSo if we can find decent basis sets (that allow a good fit), this can be more compact
51PLANET Lecture Slides (c) 2002, C. Boutilier
Linear Approx: Components
Assume basis set B = { b1 ... bk }
• each bi : S →
• we view each bi as an n-vector
• let A be the n x k matrix [ b1 ... bk ]
Linear VF: V(s) = wi bi(s)
Equivalently: V = Aw• so our approximation of V must lie in subspace
spanned by B
• let B be that subspace
52PLANET Lecture Slides (c) 2002, C. Boutilier
Approximate Value Iteration
We might compute approximate V using ValuIter:• Let V0 = Aw0 for some weight vector w0
• Perform Bellman backups to produce V1 = Aw1; V2 = Aw2; V3 = Aw3; etc...
Unfortunately, even if V0 in subspace spanned by B, L*(V0) = L*(Aw0) will generally not beSo we need to find best approximation to L*(Aw0) in B before we can proceed
53PLANET Lecture Slides (c) 2002, C. Boutilier
Projection
We wish to find a projection of our VF estimates into B minimizing some error criterion
• We’ll use max norm (standard in MDPs)
Given V lying outside B, we want a w s.t:
|| Aw – V || is minimal
54PLANET Lecture Slides (c) 2002, C. Boutilier
Projection as Linear ProgramFinding a w that minimizes || Aw – V || can be accomplished with a simple LP
Number of variables is small (k+1); but number of constraints is large (2 per state)
• this defeats the purpose of function approximation
• but let’s ignore for the moment
Vars: w1, ..., wk,
Minimize: S.T. V(s) – Aw(s) , s Aw(s) - V(s) , s
measures max norm difference between V and “best fit”
55PLANET Lecture Slides (c) 2002, C. Boutilier
Approximate Value Iteration
Run value iteration; but after each Bellman backup, project result back into subspace B
Choose arbitrary w0 and let V0 = Aw0 Then iterate
• Compute Vt =L*(Awt-1)
• Let Vt = Awt be projection of Vt into BError at each step given by
• final error, convergence not assured
Analog for policy iteration as well
56PLANET Lecture Slides (c) 2002, C. Boutilier
Factored MDPs
Suppose our MDP is represented using DBNs and our reward function is compact
• can we exploit this structure to implement approximate value iteration more effectively?
We’ll see that if our basis functions are “compact”, we can implement AVI without state enumeration (GKP-01)
• we’ll exploit principles we’ve seen in abstraction methods
57PLANET Lecture Slides (c) 2002, C. Boutilier
Assumptions
DBN action representation for each action a
• assume small set Par(X’i)
Reward is sum of components• R(X) = R1(W1) + R2(W2) + ...
• each Wi X is a small subset
Each basis function bi refers to a small subset of vars Ci
• bi(X) = bi(Ci)
State space defined by variables X1 , ... , Xn
X1 X’1
X2
X3
X’2
X’3
R(X1X2X3) = R1(X1X2) + R2(X3)
58PLANET Lecture Slides (c) 2002, C. Boutilier
Factored AVI
AVI: repeatedly do Bellman backups, projectionsWith factored MDP and basis representations
• Aw and V are functions of variables X1 , ... , Xn
• Aw is compactly representableAw = w1b1(C1) + ... + wkbk(Ck)
each Ci X is a small subset
• So Vt = Awt (projection of Vt into B ) is compact
So we need to ensure that:• each Vt (nonprojected Bellman backup) is compact
• we can perform projection effectively
59PLANET Lecture Slides (c) 2002, C. Boutilier
Compactness of Bellman Backup
Bellman backup:Q-function:
)())(|Pr(
...)())(|Pr(
...)()(
)](...)(' [)',,Pr(
...)()(
)'(' )',,Pr()(),(
''
''
'11'
1
'1
'11
2211
''111
2211
1
1
11
1
kkk
kkk
kkk
bParw
bParw
RR
bwbwa
RR
VaRsQ
t
t
tt
tt
cc cc
cc cc
ww
ccx xx
ww
xx xxxx
),(max)( saQsV ta
t
60PLANET Lecture Slides (c) 2002, C. Boutilier
Compactness of Bellman BackupSo Q-functions are (weighted) sums of a small set of compact functions:
• the rewards Ri(Wi)
• the functions fi(Par(Ci)) – each of which can be computed effectively (sum out only vars in Ci )
• note: backup of each bi is decision-theoretic regression
Maximizing over these to get VF straightforward• Thus we obtain compact rep’n of Vt =L*(Awt-1)
Problem: these new functions don’t belong to the set of basis functions
• need to project Vt into B to obtain Vt
61PLANET Lecture Slides (c) 2002, C. Boutilier
Factored Projection
We have Vt and want to find weights wt that minimize ||Awt – Vt ||
• We know Vt is the sum of compact functions
• We know Awt is the sum of compact functions
• Thus, their difference is the sum of compact functions
So we wish to minimize || fj(Zj ; wt) ||
• each fj depends on small set of vars Zj and possibly some of the weights wt
Assume weights wt are fixed for now
• then || fj(Zj ; wt) || = max { fj(zj ; wt) : xX}
62PLANET Lecture Slides (c) 2002, C. Boutilier
Variable EliminationMax of sum of compact functions: variable elim.
Complexity determined by size of intermediate factors (and elim ordering)
max X1X2X3X4X5X6 { f1(X1X2X3) + f2(X3X4) +
f3(X4X5X6) }
Elim X1: Replace f1(X1X2X3) with
f4(X2X3) = max X1 { f1(X1X2X3) }
Elim X3: Replace f2(X3X4) and f4(X2X3) with
f5(X2X4) = max X3 { f1(X1X2X3) + f4(X2X3) }
etc. (eliminating each variable in turn until maximum value is computed over entire state space)
63PLANET Lecture Slides (c) 2002, C. Boutilier
Factored Projection: Factored LPVE works for fixed weights
• but wt is what we want to optimize
• Recall LP for optimizing weights:
V(s) – Aw(s) , s
• equiv. to max {V(s) – Aw(s) , sS}
• equiv. to max {fj(zj ; w) , xX}
Vars: w1, ..., wk,
Minimize: S.T. V(s) – Aw(s) , s Aw(s) - V(s) , s
64PLANET Lecture Slides (c) 2002, C. Boutilier
Factored Projection: Factored LP
The constraints: max {fj(zj ; w) , xX}• exponentially many
• but we can “simulate” VE to reduce the expression of these constraints in the LP
• the number of constraints (and new variables) will be bounded by the “complexity of VE”
65PLANET Lecture Slides (c) 2002, C. Boutilier
Factored Projection: Factored LP
Choose an elimination ordering for computing max {fj(zj ; w) , xX}
• note: weight vector w is unknown
• but structure of VE remains the same (actual numbers can’t be computed)
For each factor (initial and intermediate) e(Z) • create a new variable u(e,z1,...,zn) for each
instantiation z1,...,zn of the domain Z
• number of new variables exponential in size (#vars) of factor
66PLANET Lecture Slides (c) 2002, C. Boutilier
Factored Projection: Factored LP
For each initial factor fj(Zj ; w) , pose constraint:
• though the w are vars, fj(Zj ; w) linear in w
u(fj,z1,...,zn) = fj(z1,...,zn;w) , z1,...,zn
67PLANET Lecture Slides (c) 2002, C. Boutilier
Factored Projection: Factored LPFor elim step where Xk removed, let
• gk(Zk) = maxXk gk1(Zk1) + gk2(Zk2) + ...
• here each gkj a factor including Xk (and is removed)
For each intrm factor gk(Zk) , pose constraint:
• force u-values for each factor to be at least max over Xk values
• number of constraints: size of factor * |Xk|
u(gk,z1,...,zn)
gk1(z1,...,zn1) + gk1(z1,...,zn1)+..., xk,z1,...,zn
68PLANET Lecture Slides (c) 2002, C. Boutilier
Factored Projection: Factored LP
Finally pose constraintThis ensures:
Note: objective function in LP minimizes • so constraints are satisfied at the max values
In this way• we optimize weights at each iteration of ValIter• but we never enumerate the state space• size of LPs bounded by total factor size in VE
ufinal()
max {fj(zj ; w) , xX} = max {V(s) – Aw(s) , sS}
69PLANET Lecture Slides (c) 2002, C. Boutilier
Some Results [GKP-01]
Basis sets considered: -characteristic functions over single variables-characteristic functions over pairs of variables
70PLANET Lecture Slides (c) 2002, C. Boutilier
Some Results [GKP-01]
Computation Time
71PLANET Lecture Slides (c) 2002, C. Boutilier
Some Results [GKP-01]
Computation Time
72PLANET Lecture Slides (c) 2002, C. Boutilier
Some Results [GKP-01]
Relative error wrt optimal VF (small problems)
73PLANET Lecture Slides (c) 2002, C. Boutilier
Linear Approximation: Summary
Results seem encouraging• 40 variable problems solved in a few hours• simple basis sets seem to work well for “network”
problems
Open issues:• are tighter (a priori) error bounds possible?• better computational performance?• where do basis functions come from?
what impact can good/poor basis set have on solution quality?
• are there “nonlinear” generalizations?
74PLANET Lecture Slides (c) 2002, C. Boutilier
An LP Formulation
AVI requires generating a large number of constraints (and solving multiple LPs/cost nets)But normal MDP can be solved by an LP directly:
• (LaV)(s) is linear in values/vars V(s)
Vars: V(s)
Minimize: sV(s)
S.T. V(s) (LaV)(s) , a,s
75PLANET Lecture Slides (c) 2002, C. Boutilier
Using Structure in LP Formulation
These constraints can be formulated without enumerating state space using cost network as before [SchPat-00]
• by not iterating, great computational savings possible a couple orders of magnitude on “networks”
• techniques like constraint generation offer even more substantial savings
76PLANET Lecture Slides (c) 2002, C. Boutilier
Good Basis Sets
A good basis set should• be reasonably small and well-factored• be such that a good approximation to V* lies in the
subspace BLatter condition hard to guaranteePossible ways to construct basis sets
• use prior knowledge of domain structuree.g., problem decomposition
• search over candidate basis setse.g., sol’n using a poor approximation might guide search for an improved basis
77PLANET Lecture Slides (c) 2002, C. Boutilier
Parallel Problem Decomposition
Decompose MDP into parallel processes
• product/join decomp.• each refers to subset
of relevant variables• actions affect each
Key issues:• how to decompose?• how to merge sol’ns?
Contrast serial decomposition
• macros [Sutton95,Parr98]
MDP1 MDP2 MDP3
78PLANET Lecture Slides (c) 2002, C. Boutilier
Generating SubMDPs
Components of additive reward: subobjectives
• often combinatorics due to many competing objectives
• e.g., logistics, process planning, order scheduling • [BouBrafmanGeib97, SinghCohn97, MHKPKDB98]
Create subMDPs for subobjectives
• use abstraction methods discussed earlier to find
subMDP relevant to each subobjective
• solve using standard methods, DTR, etc.
79PLANET Lecture Slides (c) 2002, C. Boutilier
Generating SubMDPs
Dynamic Bayes Net over Variable Set
80PLANET Lecture Slides (c) 2002, C. Boutilier
Generating SubMDPs
Green SubMDP (subset of variables)
81PLANET Lecture Slides (c) 2002, C. Boutilier
Generating SubMDPs
Red SubMDP (subset of variables)
82PLANET Lecture Slides (c) 2002, C. Boutilier
Composing Solutions
Existing methods piece together solutions in an
online fashion; for example:1. Search-based composition [BouBrafmanGeib97]:
VFs used in heuristic search
partial ordering of actions used to merge
2. Markov Task Decomposition [MHKPKDB98]:
has ability to deal with large actions spaces
MDPs with thousands of variables solvable
83PLANET Lecture Slides (c) 2002, C. Boutilier
Search-based Composition
Online action selection: standard expectimax search [DB94,97,BBS95,KS95,BG98,KMN99,...]
s2
a1
s3
a1 a2a2 a1
s4
a2
s5
s1
p2 p2 p3 p4Exp Exp
Max
84PLANET Lecture Slides (c) 2002, C. Boutilier
Search-based Composition
Online action selection: standard expectimax search [DB94,97,BBS95,KS95,BG98,KMN99,...]
Decomposed VFs viewed as heuristics (reduce requisite search depth for given error)
E.g., given subVFs f1,...fk
s2
a1
s3
a1 a2a2 a1
s4
a2
s5
s1
p2 p2 p3 p4Exp Exp
Max
V(s) <= f1(s) + f2(s) +... + fk(s)
V(s) >= max { f1(s), f2(s), ... fk(s) }
85PLANET Lecture Slides (c) 2002, C. Boutilier
Offline Composition
These subMDP solutions can be “composed” by treating subMDP VFs as a basis setApprox. VF is a linear combination of the subVFsSome preliminary results [Patrascu et al. 02] suggest this technique can work well
• for decomposable MDPs, subVFs offer better solution quality than simple characteristic functions
• often piecewise linear combinations work better than linear combinations [Poupart et al. 02]
86PLANET Lecture Slides (c) 2002, C. Boutilier
Wrap Up
We’ve seen a number of ways in which logical representations and computational methods can help make the solution of stochastic decision processes more tractableThese ideas at the interface of knowledge representation, operations research, reasoning under uncertainty and machine learning communities
• this interface offers a wealth of interesting and practically important research ideas
87PLANET Lecture Slides (c) 2002, C. Boutilier
Other Techniques
Many more techniques being used to tackle the tractability of solving MDPs
other function approximation methodssampling and simulation methodsdirect search in policy spaceonline search techniques/heuristic generationreachability analysishierarchical and program structure
88PLANET Lecture Slides (c) 2002, C. Boutilier
Extending the Model
Many interesting extensions of the basic (finite, fully observable) model being studiedPartially observable MDPs
• many of the techniques discussed have been applied to POMDPs
Continuous/hybrid state and action spacesProgramming as partial policy specificationMultiagent and game-theoretic models
89PLANET Lecture Slides (c) 2002, C. Boutilier
ReferencesC. Boutilier, T. Dean, S. Hanks, Decision Theoretic Planning:
Structural Assumptions and Computational Leverage, Journal of Artif. Intelligence Research 11:1-94, 1999.C. Boutilier, R. Dearden, M. Goldszmidt, Stochastic Dynamic
Programming with Factored Representations, Artif. Intelligence 121:49-107, 2000.R. Bahar, et al., Algebraic Decision Diagrams and their Applications,
Int’l Conf. on CAD, pp.188-181, 1993.J. Hoey, et al., SPUDD: Stochastic Planning using Decision
Diagrams, Conf. on Uncertainty in AI, Stockholm, pp.279-288, 1999.R. St-Aubin, J. Hoey, C. Boutilier, APRICODD: Approximate Policy
Construction using Decision Diagrams, Advances in Neural Info. Processing Systems 13, Denver, pp.1089-1095, 2000.C. Boutilier, R. Dearden, Approximating Value Trees in Structured
Dynamic Programming, Int’l Conf. on Machine Learning, Bari, pp.54-62, 1996.
90PLANET Lecture Slides (c) 2002, C. Boutilier
References (con’t)
C. Boutilier, R. Reiter, B. Price, SPUDD: Symbolic Dynamic Programming for First-order MDPs, Int’l Joint Conf. on AI, Seattle, pp.690-697, 2001.C. Boutilier, R. Reiter, M. Soutchanski, S. Thrun, Decision-Theoretic,
High-level Agent Programming in the Situation Calculus, AAAI-00, Austin, pp.355-362, 2000.R. Reiter. Knowledge in Action: Logical Foundations for Describing
and Implementing Dynamical Systems, MIT Press, 2001.
91PLANET Lecture Slides (c) 2002, C. Boutilier
References (con’t)
C. Guestrin, D. Koller, R. Parr, Max-norm projections for factored MDPs, Int’l Joint Conf. on AI, Seattle, pp.673-680, 2001.C. Guestrin, D. Koller, R. Parr, Multiagent planning with factored
MDPs, Advances in Neural Info. Proc. Sys. 14, Vancouver, 2001.D. Schuurmans, R. Patrascu, Direct value approximation for factored
MDPs, Advances in Neural Info. Proc. Sys. 14, Vancouver, 2001.R. Patrascu, et al., Greedy linear value approximation for factored
MDPs, AAAI-02, Edmonton, 2002.P. Poupart, et al., Piecewise linear value approximation for factored
MDPs, AAAI-02, Edmonton, 2002.J. Tsitsiklis, B. Van Roy, Feature-based methods for large scale
dynamic programming, Machine Learning 22:59-94, 1996.
92PLANET Lecture Slides (c) 2002, C. Boutilier
References (con’t)
C. Boutilier, R. Brafman, C. Geib, Prioritized goal decomposition of Markov decision processes: Toward a synthesis of classical and decision theoretic planning, Int’l Joint Conf. on AI, Nagoya, pp.1156-1162, 1997.N. Meuleau, et al., Solving very large weakly coupled Markov
decision processes, AAAI-98, Madison, pp.165-172, 1998.S. Singh, D. Cohn. How to dynamically merge Markov decision
processes. Advances in Neural Info. Processing Systems 10, Denver, pp.1057-1063, 1998.