Efficient Solution Algorithms for Factored MDPs
by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman
Presented by Arkady Epshteyn
Problem with MDPs
• Exponential number of states• Example: Sysadmin Problem
• 4 computers: M1, M2 , M3 , M4
• Each machine is working or has failed.• State space: 24
• 8 actions: whether to reboot each machine or not• Reward: depends on the number of working
machines
Factored Representation
• Transition model: DBN• Reward model:
k
j
j xrxR1
)()(
Approximate Value Function
• Linear value function:
• Basis functions:
hi(Xi=true)=1
hi(Xi=false)=0
h0=1
k
j
jj xhwxV1
)()(
Markov Decision Processes
'
)( )'()|'()()(x
x xVxxPxRxV For fixed policy :
The optimal value function V*:
])'(*)|'()([max)(*'
x
aaa
xVxxPxRxV
Solving MDPMethod 1: Policy Iteration
• Value determination
• Policy Improvement
'
)()( )'()|'()()(x
txx
t xVxxPxRxV
•Polynomial in the number of states N•Exponential in the number of variables K
])'()|'()([maxarg)('
1
x
taa
a
t xVxxPxRx
Solving MDPMethod 2: Linear Programming
Intuition: compare with the fixed point of V(x):
axVxxPxRVtoSubject
xiVxMinimize
VVVariables
i
j
jijaai
i
x
ii
N
i
,,)|()(:
0)(:,)(:
,...,: 1
•Polynomial in the number of states N•Exponential in the number of variables
])'(*)|'()([max)(*'
x
aaa
xVxxPxRxV
Value Function Approximation
axxhwxxPxRxhwtoSubject
xixhwxMinimize
wwVariables
i
ii
x
aa
i
ii
x
k
i
ii
K
,,)'()|()()(:
0)(:,)()(:
,...,:
'
'
1
1
axVxxPxRVtoSubject
xiVxMinimize
VVVariables
i
j
jijaai
i
x
ii
N
i
,,)|()(:
0)(:,)(:
,...,: 1
Objective function
axxhwxxPxRxhwtoSubject
xixhwxMinimize
wwVariables
i
ii
x
aa
i
ii
i
x i
ii
K
,,)'()|()()(:
0)(:,)()(:
,...,:
'
'
1
•Objective function polynomial in the number of basis functions
i
i
Cx
i
i
ii
c
ii
i
i
x
i
x i
ii
xcwhere
chcw
xhxw
xhwx
)()(
,)()(
)()(
)()(
Each Constraint: Backprojection
axxhwxxPxRxhwtoSubject
xixhwxMinimize
wwVariables
i
ii
x
aa
i
ii
i
x i
ii
K
,,)'()|()()(:
0)(:,)()(:
,...,:
'
'
1
i
i
x
ai
i
ii
x
a xhxxPwxhwxxP )'()|()'()|('
'
'
'
))(|(
)|(
)|'(
iii
ii
i
cpacEh
xcEh
xxEh
Representing Exponentially Many Constraints
axxhwxxPxRxhwtoSubject
xixhwxMinimize
wwVariables
i
ii
x
aa
i
ii
i
x i
ii
K
,,)'()|()()(:
0)(:,)()(:
,...,:
'
'
1
axRxhxhxxPw
axxRxhxhxxPw
axxhwxxPxRxhw
a
i
ii
x
aix
a
i
ii
x
ai
i
ii
x
aa
i
ii
),()]()'()|([max0
,),()]()'()|([0
,,)'()|()()(
'
'
'
'
'
'
Restricted Domain
i j
jiix
a
i
iaii
x
a
i
ii
x
aix
xrxfw
xRxhxgw
axRxhxhxxPw
)()(max
)()]()([max
),()]()'()|([max0'
'
1. Backprojection - depends on few variables2. Basis function3. Reward function
1 2 3
Variable Elimination
)],(),([max),(
)],(),(),([max
)]],(),([max),(),([max
),(),(),(),(max
)()(max
4324214
321
321312221113,2,1
4324214
312221113,2,1
432421312221114,3,2,1
xxrxxrxxewhere
xxexxfwxxfw
xxrxxrxxfwxxfw
xxrxxrxxfwxxfw
xrxfw
x
xxx
xxxx
xxxx
i j
jiix
- similar to Bayesian Networks
Maximization as Linear Constraints
...
),(),(),(
),(),(),(
),(),(),(
),(),(),(
:sconstrainttoEquivalent
)],(),([max),(
432421321
432421321
432421321
432421321
4324214
321
xxrxxrxxe
xxrxxrxxe
xxrxxrxxe
xxrxxrxxe
xxrxxrxxex
• Exponential in the size of each function’s domain, not the number of states
Factored LP: Scaling
Rule-based Representation
Approximate Value Function
k
j hRule
ij
k
j
jj
k
j
jj
ji
xxxxRulew
xxxxhwxhwxV
1
4321
1
4321
1
),,,(
),,,()()(
x1
x30
5 0.6
h1:
6.0:,:
5:,:
0::
313
312
11
xxRule
xxRule
xRule
Notice: compact representation (2/4 variables, 3/16 rules)
Summing Over Rules
k
j hRule
ij
ji
xxxxRulewxV1
4321 ),,,()(
x1
x3u1
u2 u3
h1(x)
x2
x1u4
u5
h2(x)
+
u6
=
x2
x1
u1+u4
u2+u6 u3+u6
x1
x3 x3u5+u1
u2+u4 u3+u4
Multiplying over Rules
• Analogous construction
axRxhxhxxPw a
i
ii
x
aix
),()]()'()|([max0'
'
Rule-based MaximizationaxRxhxhxxPw a
i
ii
x
aix
),()]()'()|([max0'
'
x1
x2u1
u2 x3
u3 u4
Eliminate x2
x1
x3u1
max(u2,u3) max(u2,u4)
Rule-based Linear Program
• Backprojection, objective function – handled in a similar way
• All the operations (summation, multiplication, maximization) – keep rule representation intact
• is a linear function ji hRule
ij xxxxRulew ),,,( 4321
Conclusions
• Compact representation can be exploited to solve MDPs with exponentially many states efficiently.
• Still NP-complete in the worst case.• Factored solution may increase the size of LP
when the number of states is small (but it scales better).
• Success depends on the choice of the basis functions for value approximation and the factored decomposition of rewards and transition probabilities.