Upload
sibyl-small
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Deciding Under Deciding Under Probabilistic Probabilistic UncertaintyUncertainty
Russell and Norvig: Sect. 17.1-3,Chap. 17Russell and Norvig: Sect. 17.1-3,Chap. 17CS121 – Winter 2003
Non-deterministic vs. Non-deterministic vs. Probabilistic UncertaintyProbabilistic Uncertainty
Non-deterministic model
?
ba c
{a,b,c}
decision that is best for worst case
~ Adversarial search
?
ba c
{a(pa),b(pb),c(pc)}
decision that maximizes expected utility valueProbabilistic
model
s0
s3s2s1
A1
0.2 0.7 0.1100 50 70 utility of state
U(S0) = 100 x 0.2 + 50 x 0.7 + 70 x 0.1 = 20 + 35 + 7 = 62
One State/One Action One State/One Action ExampleExample
s0
s3s2s1
A1
0.2 0.7 0.1100 50 70
A2
s40.2 0.8
80
• U1(S0) = 62• U2(S0) = 74• U(S0) = max{U1(S0),U2(S0)} = 74
One State/Two Actions One State/Two Actions ExampleExample
s0
s3s2s1
A1
0.2 0.7 0.1100 50 70
A2
s40.2 0.8
80
• U1(S0) = 62 – 5 = 57• U2(S0) = 74 – 25 = 49• U(S0) = max{U1(S0),U2(S0)} = 57
-5 -25
Introducing Action CostsIntroducing Action Costs
Example: Finding JulietExample: Finding Juliet
Charles’ off.
Juliet’s off.
Conf. room
10min
5min
10min
Example: Finding JulietExample: Finding Juliet
States are: S0: Romeo in Charles’ office S1: Romeo in Juliet’s office and Juliet here S2: Romeo in Juliet’s office and Juliet not here S3: Romeo in conference room and Juliet here S4: Romeo in conference room and Juliet not here
Actions are: GJO (go to Juliet’s office) GCR (go to conference room)
Utility ComputationUtility Computation
0
GJO
GJO
GCR
GCR
1
1
2
3
4
3
0
0
0
0
-10
-10
-10
-15
-10
Charles’ off.
Juliet’s off.
Conf. room
10min
5min
10min
n-Step Decision Processn-Step Decision Process
There is a single initial stateStates reached after i steps are all different from those reached after ji stepsEach state i has a its reward R(i) Each state reached after n steps is terminalThe goal is to maximize the sum of rewards
0 1 2 3
n-Step Decision Processn-Step Decision Process
Utility of state i:
U(i) = R(i) + maxa kP(k | a.i) U(k)
0 1 2 3
i
• Two possible actions from i : a1 and a2
• a1 leads to k11 with probability P11 or k12 with probability P12
P11 = P(k11 | a1.i) P12 = P(k12 | a1.i)• a2 leads to k21 with probability P21 or k22 with probability P22
• U(i) = R(i) + max {P11 U(k11) + P12 U(k12), P21 U(k21) + P22 U(k22)}
n-Step Decision Processn-Step Decision Process
For j = n-1, n-2, …, 0 do:
For every state si attained after step j Compute the utility of si
Label that state with the corresponding best action
Utility of state i:
U(i) = R(i) + maxa kP(k | a.i) U(k)
Best choice of action at state i:
*(i) = arg maxa kP(k | a.i) U(k)0 1 2 3
i
Optimal policy
n-Step Decision Process …n-Step Decision Process …
Utility of state i:
U(i) = maxa (kP(k | a.i) U(k) – Ca)
Best choice of action at state i:
*(i) = arg maxa (kP(k|a.i)U(k)–Ca)
0 1 2 3
i
… with costs on actions instead ofrewards on states
0
GJO
GJO
GCR
GCR
1
1
2
3
4
3
0 1 2 3
i
0
GJO
GJO
GCR
GCR
1
2
4
3
0 1 2 3
i
Target TrackingTarget Tracking
targetrobot
• The robot must keep the target in view• The target’s trajectory is not known in advance• The environment may or may not be known
Target TrackingTarget Tracking
targetrobot
• The robot must keep the target in view• The target’s trajectory is not known in advance• The environment may or may not be known
States Are Indexed by States Are Indexed by Time Time
• State = (robot-position, target-position, time)• Action = (stop, up, down, right, left)• Outcome of an action = 5 possible states, each with probability 0.2
([i,j], [u,v], t)
• ([i+1,j], [u,v], t+1)• ([i+1,j], [u-1,v], t+1)• ([i+1,j], [u+1,v], t+1)• ([i+1,j], [u,v-1], t+1)• ([i+1,j], [u,v+1], t+1)
right
Each state has 25 successors
h-Step Planning Processh-Step Planning Process
Planning horizon h
“Terminal” states:• States where the target is not visible• States at depth h
Reward function:• Target visible +1• Target not visible 0Maximizing the sum of rewards ~ maximizing escape time
R(state) = t
where 0 < < 1
discounting
h-Step Planning Processh-Step Planning Process
Planning horizon h
The planner computes the optimal policy over this tree of states, but only the firststep of the policy is executed.
Then everything is repeated again …(sliding horizon)
h-Step Planning Processh-Step Planning Process
Planning horizon h
h-Step Planning Processh-Step Planning Process
Planning horizon h
h is chosen such that the computation of the optimal policy over the tree can be computed in one increment of time
Example With No PlannerExample With No Planner
Example With PlannerExample With Planner
Other ExampleOther Example
h-Step Planning Processh-Step Planning Process
Planning horizon h
The optimal policy over this tree is not theoptimal policy that would have been computedif a prior model of the environment had beenavailable, along with an arbitrarily fast computer
0
GJO
GJO
GCR
GCR
1
1
2
3
4
30 1 2 3
i
Simple Robot Navigation Simple Robot Navigation ProblemProblem
• In each state, the possible actions are U, D, R, and L
Probabilistic Transition Probabilistic Transition ModelModel
• In each state, the possible actions are U, D, R, and L• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move)
Probabilistic Transition Probabilistic Transition ModelModel
• In each state, the possible actions are U, D, R, and L• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move)• With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move)
Probabilistic Transition Probabilistic Transition ModelModel
• In each state, the possible actions are U, D, R, and L• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the robot is already in the top row, then it does not move)• With probability 0.1 the robot moves right one square (if the robot is already in the rightmost row, then it does not move)• With probability 0.1 the robot moves left one square (if the robot is already in the leftmost row, then it does not move)
Markov PropertyMarkov Property
The transition properties depend only on the current state, not on previous history (how that state was reached)
Sequence of ActionsSequence of Actions
• Planned sequence of actions: (U, R)• U is executed
2
3
1
4321
Sequence of ActionsSequence of Actions
• Planned sequence of actions: (U, R)• U is executed
2
3
1
4321
Sequence of ActionsSequence of Actions
• Planned sequence of actions: (U, R)• U is executed• R is executed
2
3
1
4321
Sequence of ActionsSequence of Actions
• Planned sequence of actions: (U, R)• U is executed• R is executed
2
3
1
4321
Sequence of ActionsSequence of Actions
• Planned sequence of actions: (U, R)
2
3
1
4321
[3,2]
Sequence of ActionsSequence of Actions
• Planned sequence of actions: (U, R)• U is executed
2
3
1
4321
[3,2]
[4,2][3,3][3,2]
HistoriesHistories
• Planned sequence of actions: (U, R)• U has been executed• R is executed
• There are 9 possible sequences of states – called histories – and 6 possible final states for the robot!
4321
2
3
1
[3,2]
[4,2][3,3][3,2]
[3,3][3,2] [4,1] [4,2] [4,3][3,1]
Probability of Reaching the Probability of Reaching the GoalGoal
•P([4,3] | (U,R).[3,2]) = P([4,3] | R.[3,3]) x P([3,3] | U.[3,2]) + P([4,3] | R.[4,2]) x P([4,2] | U.[3,2])
2
3
1
4321
Note importance of Markov property in this derivation
•P([3,3] | U.[3,2]) = 0.8•P([4,2] | U.[3,2]) = 0.1
•P([4,3] | R.[3,3]) = 0.8•P([4,3] | R.[4,2]) = 0.1
•P([4,3] | (U,R).[3,2]) = 0.65
Utility FunctionUtility Function
• The robot needs to recharge its batteries• [4,3] provides power supply• [4,2] is a sand area from which the robot cannot escape• [4,3] or [4,2] are terminal states• Reward of a terminal state: +1 or -1• Reward of a non-terminal state: -1/25• Utility of a history: sum of rewards of traversed states• Goal: Maximize the utility of the history
-1
+1
2
3
1
4321
-1
+1
2
3
1
4321
Histories are potentially unbounded and the same state can be reached many times
Utility of an Action Utility of an Action SequenceSequence
-1
+1
• Consider the action sequence (U,R) from [3,2]
2
3
1
4321
Utility of an Action Utility of an Action SequenceSequence
-1
+1
• Consider the action sequence (U,R) from [3,2]• A run produces one among 7 possible histories, each with some probability
2
3
1
4321
[3,2]
[4,2][3,3][3,2]
[3,3][3,2] [4,1] [4,2] [4,3][3,1]
Utility of an Action Utility of an Action SequenceSequence
-1
+1
• Consider the action sequence (U,R) from [3,2]• A run produces one among 7 possible histories, each with some probability• The utility of the sequence is the expected utility of the histories:
U = hUh P(h)
2
3
1
4321
[3,2]
[4,2][3,3][3,2]
[3,3][3,2] [4,1] [4,2] [4,3][3,1]
Optimal Action SequenceOptimal Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]• A run produces one among 7 possible histories, each with some probability• The utility of the sequence is the expected utility of the histories• The optimal sequence is the one with maximal utility
2
3
1
4321
[3,2]
[4,2][3,3][3,2]
[3,3][3,2] [4,1] [4,2] [4,3][3,1]
Optimal Action SequenceOptimal Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]• A run produces one among 7 possible histories, each with some probability• The utility of the sequence is the expected utility of the histories• The optimal sequence is the one with maximal utility• But is the optimal action sequence what we want to compute?
2
3
1
4321
[3,2]
[4,2][3,3][3,2]
[3,3][3,2] [4,1] [4,2] [4,3][3,1]
NO!! Except if sequence is executed blindly (open-loop strategy)
observable stateRepeat: s sensed state If s is terminal then exit a choose action (given s) Perform a
Reactive Agent AlgorithmReactive Agent Algorithm
Utility of state i:
U(i) = maxa (kP(k | a.i) U(k) – Ca)
Best choice of action at state i:
*(i) = arg maxa (kP(k|a.i)U(k)–Ca)0 1 2 3
i
Policy Policy (Reactive/Closed-Loop (Reactive/Closed-Loop Strategy)Strategy)
• A policy is a complete mapping from states to actions
-1
+1
2
3
1
4321
Repeat: s sensed state If s is terminal then exit a (s) Perform a
Reactive Agent AlgorithmReactive Agent Algorithm
Optimal PolicyOptimal Policy
-1
+1
• A policy is a complete mapping from states to actions• The optimal policy * is the one that always yields a history (ending at a terminal state) with maximal expected utility
2
3
1
4321
Makes sense because of Markov property
Note that [3,2] is a “dangerous” state that the optimal policy
tries to avoid
Optimal PolicyOptimal Policy
-1
+1
• A policy is a complete mapping from states to actions• The optimal policy * is the one that always yields a history with maximal expected utility
2
3
1
4321
This problem is called aMarkov Decision Problem (MDP)
How to compute *?
Utility of state i:
U(i) = maxa (kP(k | a.i) U(k) – Ca)
Best choice of action at state i:
*(i) = arg maxa (kP(k|a.i)U(k)–Ca)0 1 2 3
i
The trick used in target-The trick used in target-tracking (indexing state by tracking (indexing state by time) can be applied …time) can be applied …
-1
+1
… but would yield a large tree + sub-optimal policy
First-Step AnalysisFirst-Step Analysis
Simulate one step:
U(i) = R(i) + maxa kP(k | a.i) U(k)
*(i) = arg maxa kP(k | a.i) U(k)
( Principle of Max Expected Utility)
-1
+1
?
What is the Difference?What is the Difference?
*(i) = arg maxa kP(k | a.i) U(k)
U(i) = R(i) + maxa kP(k | a.i) U(k)
0 1 2 3
-1
+1
Value IterationValue IterationInitialize the utility of each non-terminal state si
to U0(i) = 0
For t = 0, 1, 2, …, do:
Ut+1(i) R(i) + maxa kP(k | a.i) Ut(k)
-1
+1
2
3
1
4321
Value IterationValue IterationInitialize the utility of each non-terminal state si
to U0(i) = 0
For t = 0, 1, 2, …, do:
Ut+1(i) R(i) + maxa kP(k | a.i) Ut(k)Ut([3,1])
t0 302010
0.6110.5
0-1
+1
2
3
1
4321
0.705 0.655 0.3880.611
0.762
0.812 0.868 0.918
0.660
Note the importanceof terminal states andconnectivity of thestate-transition graph
Not very different from indexing states by time
Policy IterationPolicy Iteration
Pick a policy at random
Policy IterationPolicy Iteration
Pick a policy at random Repeat: Compute the utility of each state for
U (i) = R(i) + kP(k | (i).i) U (k)Set of linear equations
(often a sparse system)
Policy IterationPolicy Iteration
Pick a policy at random Repeat: Compute the utility of each state for
U (i) = R(i) + kP(k | (i).i) U (k) Compute the policy ’ given these
utilities
’(i) = arg maxa kP(k | a.i) U(k)
If ’ = then return else’
Application of First Step Application of First Step Analysis Analysis Computing the Probability of Computing the Probability of Folding pFolding pfoldfold of a Proteinof a Protein
Unfolded state Folded state
pfold1- pfold
HIV integrase
Computation Through Computation Through SimulationSimulation
10K to 30K independent simulations
Capture the stochastic nature of Capture the stochastic nature of molecular molecular motion by a probabilistic roadmapmotion by a probabilistic roadmap
vi
vj
Pij
Edge probabilitiesEdge probabilities
Follow Metropolis criteria:
otherwise. ,
1
;0 if ,)/exp(
i
iji
Bij
ij
N
EN
TkE
P
Self-transition probability:
ijijii PP 1
vj
vi
Pij
Pii
Pii
F: Folded setU: Unfolded set
First-Step AnalysisFirst-Step Analysis
Pij
i
k
j
l
m
Pik Pil
Pim
Let fi = pfold(i)After one step: fi = Pii fi + Pij fj + Pik fk + Pil fl + Pim fm
=1 =1
One linear equation per node Solution gives pfold for all nodes
No explicit simulation run All pathways are taken into
account Sparse linear system
Partially Observable Partially Observable Markov Decision ProblemMarkov Decision Problem
• Uncertainty in sensing:
A sensing operation returns multiple states, with a probability distribution
Example: Target TrackingExample: Target Tracking
There is uncertaintyin the robot’s and target’s positions, and this uncertaintygrows with further motion
There is a risk that the target escape behind the corner requiring the robot to move appropriately
But there is a positioninglandmark nearby. Shouldthe robot tries to reduceposition uncertainty?
SummarySummary
Probabilistic uncertainty Utility function Optimal policy Maximal expected utility Value iteration Policy iteration