Joelle Pineau Robotics Institute Carnegie Mellon University Stanford University May 3, 2004

Tractable Planning for Real-World Robotics:

The promises and challenges of dealing with uncertainty

Joelle PineauRobotics Institute

Carnegie Mellon University

Stanford University

May 3, 2004

Joelle Pineau Tractable Planning for Real-World Robotics 2

Robots in unstructured environments


A vision for robotic-assisted health-care

Moving thingsaround

Moving thingsaround Supporting

inter-personalcommunication

Supportinginter-personal

communication

Calling for helpin emergencies

Calling for helpin emergencies

Monitoring Rx adherence

& safety

Monitoring Rx adherence

& safety

Providinginformation Providing

information

Reminding to eat, drink, & take meds

Reminding to eat, drink, & take meds

Providing physical

assistance

Providing physical

assistance


What is hard about planning for real-world robotics?

• Generating a plan that has high expected utility.

• Switching between different representations of the world.

• Resolving conflicts between interfering jobs.

• Accomplishing jobs in changing, partly unknown environments.

• Handling percepts which are incomplete, ambiguous, outdated,

incorrect.

* Highlights from a proposed robot grand challenge


Talk outline

• Uncertainty in plan-based robotics

• Partially Observable Markov Decision Processes (POMDPs)

• POMDP solver #1: Point-based value iteration (PBVI)

• POMDP solver #2: Policy-contingent abstraction (PolCA+)


Why use a POMDP?

• POMDPs provide a rich framework for sequential decision-making, which can model:

– Effect uncertainty

– State uncertainty

– Varying rewards across actions and goals


Robotics:

• Robust mobile robot navigation[Simmons & Koenig, 1995; + many more]

• Autonomous helicopter control[Bagnell & Schneider, 2001; Ng et al., 2003]

• Machine vision[Bandera et al., 1996; Darrell & Pentland, 1996]

• High-level robot control[Pineau et al., 2003]

• Robust dialogue management[Roy, Pineau & Thrun, 2000; Peak & Horvitz, 2000]

POMDP applications in last decade

Others:

• Machine maintenance[Puterman., 1994]

• Network troubleshooting[Thiebeaux et al., 1996]

• Circuits testing [correspondence, 2004]

• Preference elicitation[Boutilier, 2002]

• Medical diagnosis[Hauskrecht, 1997]


POMDP model

POMDP is n-tuple { S, A, Z, T, O, R }:

What goes on: st-1 st

at-1 at

S = state setA = action setZ = observation set

What we see: zt-1 zt


POMDP model



at-1 at

T: Pr(s’|s,a) = state-to-state transition probabilitiesS = state setA = action setZ = observation set



POMDP model



at-1 at

T: Pr(s’|s,a) = state-to-state transition probabilitiesO: Pr(z|s,a) = observation generation probabilities




POMDP model



at-1 at

T: Pr(s’|s,a) = state-to-state transition probabilitiesO: Pr(z|s,a) = observation generation probabilitiesR(s,a) = reward function



rt-1 rt


POMDP model



at-1 atWhat we see: zt-1 zt

What we infer: bt-1 bt

rt-1 rt

T: Pr(s’|s,a) = state-to-state transition probabilitiesO: Pr(z|s,a) = observation generation probabilitiesR(s,a) = reward function



Examples of robot beliefs

robot particles

Uniform belief Bi-modal belief


Understanding the belief state

• A belief is a probability distribution over states

Where Dim(B) = |S|-1

– E.g. Let S={s1, s2}

P(s1)

0

1





– E.g. Let S={s1, s2, s3}

P(s1)

P(s2)

0

1

1





– E.g. Let S={s1, s2, s3 , s4}

P(s1)

P(s2)

0

1

1

P(s3)


POMDP solving

Objective: Find the sequence of actions that maximizes the expected sum of rewards.

Bb

AabVbabTabRbV

'

)'()',,(),(max)(

Valuefunction

Immediatereward

Futurereward


• Represent V(b) as the upper surface of a set of vectors.– Each vector is a piece of the control policy (= action sequences).

– Dim(vector) = number of states.

• Modify / add vectors to update value fn (i.e. refine policy).

POMDP value function

P(s1)

V(b)

b

2 states


Optimal POMDP solving

• Simple problem: 2 states, 3 actions, 3 observations

V0(b)

b

Plan length #vectors 0 1

P(break-in)




P(break-in)

V1(b)

b

Plan length # vectors 0 1 1 3

Call-911

Investigate

Go-to-bed




V2(b)

b

Plan length # vectors 0 1 1 3 2 27

P(break-in)

Call-911

Investigate

Go-to-bed




V3(b)

b

Plan length # vectors 0 1 1 3 2 27 3 2187

P(break-in)

Call-911

Investigate

Go-to-bed




Plan length # vectors 0 1 1 3 2 27 3 2187 4 14,348,907V4(b)

b

P(break-in)

Call-911

Investigate

Go-to-bed


The curse of history

)A( Z1 nn O

Policy size grows exponentially with the

planning horizon:

Where Γ = policy sizen = planning horizonA = # actionsZ = # observations


How many vectors for this problem?

104 (navigation) x 103 (dialogue) states1000+ observations100+ actions


Talk outline






Exact solving assumes all beliefs are equally likely

robot particles

Uniform belief Bi-modal belief N-modal belief

INSIGHT: No sequence of actions and observations canproduce this N-modal belief.


A new algorithm: Point-based value iteration

P(s1)

V(b)

b1 b0 b2

Approach:

Select a small set of belief points

Plan for those belief points only



P(s1)

V(b)

b1 b0 b2a,z a,z

Approach:

Select a small set of belief points Use well-separated, reachable beliefs

Plan for those belief points only



P(s1)

V(b)

b1 b0 b2a,z a,z

Approach:


Plan for those belief points only Learn value and its gradient



Approach:


Plan for those belief points only Learn value and its gradient

Pick action that maximizes value: bbV

max)(

P(s1)

V(b)

bb1 b0 b2


The anytime PBVI algorithm

• Alternate between:

1. Growing the set of belief point

2. Planning for those belief points

• Terminate when you run out of time or have a good policy.


Complexity of value update

Exact Update PBVI

Time:Projection S2 A Z n S2 A Z n

Sum S A nZ S A Z n B

Size: (# vectors) A nZ B

where: S = # states n = # vectors at iteration n A = # actions B = # belief points

Z = # observations


Theoretical properties of PBVI

Theorem: For any set of belief points B and planning horizon n, the error of the PBVI algorithm is bounded by:

P(s1)

V(b)

b1 b0 b2

Where is the set of reachable beliefsB is the set of all beliefs

1'2minmax* ||'||minmax

)1(

)(|||| bb

RRVV Bbbn

Bn

Err Err


The anytime PBVI algorithm

• Alternate between:

1. Growing the set of belief point

2. Planning for those belief points

• Terminate when you run out of time or have a good policy.


PBVI’s belief selection heuristic

1. Leverage insight from policy search methods:

– Focus on reachable beliefs.

P(s1)

b ba1,z2ba2,z2ba2,z1ba1,z1

a2,z2 a1,z2

a2,z1

a1,z1





2. Focus on high probability beliefs:

– Consider all actions, but stochastic observation choice.

P(s1)

b ba1,z2ba2,z1

a1,z2

a2,z1

ba2,z2ba1,z1





2. Focus on high probability beliefs:

– Consider all actions, but stochastic observation choice.

3. Use the error bound on point-based value updates:

– Select well-separated beliefs, rather than near-by beliefs.

P(s1)

b ba1,z2ba2,z1

a1,z2

a2,z1


Classes of value function approximations

1. No belief [Littman et al., 1995]

3. Compressed belief[Poupart&Boutilier, 2002;

Roy&Gordon, 2002]

x1

x0

x2

2. Grid over belief[Lovejoy, 1991; Brafman 1997;

Hauskrecht, 2000; Zhou&Hansen, 2001]

4. Sample belief points[Poon, 2001; Pineau et al, 2003]


Performance on well-known POMDPs

Maze1

0.20

0.94

0.00

2.30

2.25

REWARD

Maze2

0.11

-

0.07

0.35

0.34

Maze3

0.26

-

0.11

0.53

0.53

Maze1

0.19

-

24hrs

12166

3448

TIME

Maze2

1.44

-

24hrs

27898

360

Maze3

0.51

-

24hrs

450

288

Maze1

-

174

-

660

470

# Belief points

Maze2

-

337

-

1840

95

Maze3

-

-

-

300

86

Method

No belief[Littman&al., 1995]

Grid[Brafman., 1997]

Compressed[Poupart&al., 2003]

Sample[Poon, 2001]

PBVI[Pineau&al., 2003]

Maze1:36 states

Maze2:92 states

Maze3: 60 states


PBVI in the Nursebot domain

Objective: Find the patient!

State space = RobotPosition PatientPosition

Observation space = RobotPosition + PatientFound

Action space = {North, South, East, West, Declare}


PBVI performance on find-the-patient domain

Patient found 17% of trials

Patient found 90% of trialsNo Belief PBVI

No Belief

PBVI


Validation of PBVI’s belief expansion heuristic

Greedy

PBVI

No BeliefRandom

Find-the-patient domain870 states, 5 actions, 30 observations


Policy assuming full observability

You find some:(25%)

You loose some:(75%)


PBVI Policy with 3141 belief points

You find some:(81%)



PBVI Policy with 643 belief points

You find some:(22%)



Highlights of the PBVI algorithm

• Algorithmic:

– New belief sampling algorithm.

– Efficient heuristic for belief point selection.

– Anytime performance.

• Experimental:

– Outperforms previous value approximation algorithms on known problems.

– Solves new larger problem (1 order of magnitude increase in problem size).

• Theoretical:

– Bounded approximation error.

[ Pineau, Gordon & Thrun, IJCAI 2003. Pineau, Gordon & Thrun, NIPS 2003. ]


Back to the grand challenge

How can we go from 103 states to real-world problems?


Talk outline






Navigation

Structured POMDPs

Many real-world decision-making problems exhibit structure inherent to the problem domain.

Cognitive support Social interaction

High-level controller

Move AskWhere

Left Right Forward Backward


Structured POMDP approaches

Factored models[Boutilier & Poole, 1996; Hansen & Feng, 2000; Guestrin et al., 2001]

– Idea: Represent state space with multi-valued state features.

– Insight: Independencies between state features can be leveraged to

overcome the curse of dimensionality.

Hierarchical POMDPs[Wiering & Schmidhuber, 1997; Theocharous et al., 2000; Hernandez-Gardiol &

Mahadevan, 2000; Pineau & Thrun, 2000]

– Idea: Exploit domain knowledge to divide one POMDP into many

smaller ones.

– Insight: Smaller action sets further help overcome the curse of history.


A hierarchy of POMDPs

Act

ExamineHealth Navigate

MoveVerifyFluids

ClarifyGoal

North South East West

VerifyMeds

subtask

abstract action

primitive action


PolCA+: Planning with a hierarchy of POMDPs

Navigate

Move ClarifyGoal

South East WestNorth

AMove = {N,S,E,W}

ACTIONSNorthSouthEastWest

ClarifyGoalVerifyFluidsVerifyMeds



Step 1: Select the action set



Navigate

Move ClarifyGoal


AMove = {N,S,E,W}

SMove = {s1,s2}

STATE FEATURESX-positionY-position

X-goalY-goal

HealthStatus


X-goalY-goal

HealthStatus






Step 2: Minimize the state set



Navigate

Move ClarifyGoal


AMove = {N,S,E,W}

SMove = {s1,s2}


X-goalY-goal

HealthStatus


X-goalY-goal

HealthStatus





PARAMETERS

{bh,Th,Oh,Rh}

PARAMETERS

{bh,Th,Oh,Rh}



Step 3: Choose parameters



Navigate

Move ClarifyGoal



X-goalY-goal

HealthStatus


X-goalY-goal

HealthStatus





PLAN

h

PLAN

h

PARAMETERS

{bh,Th,Oh,Rh}

PARAMETERS

{bh,Th,Oh,Rh}



Step 3: Choose parameters

Step 4: Plan task h

AMove = {N,S,E,W}

SMove = {s1,s2}


PolCA+ in the Nursebot domain

• Goal: A robot is deployed in a nursing home, where it provides reminders to elderly users and accompanies them to appointments.


Performance measure

-2000

2000

6000

10000

14000

0 400 800 1200

Time Steps

Cum

ulat

ive

Rew

ard

PolCA+

PolCA

QMDP

Hierarchy + Belief

Execution Steps

Hierarchy + Belief

Hierarchy + BeliefPolCA+


Comparing user performance

0.1 0.10.18

POMDP PolicyNo Belief Policy


Visit to the nursing home


Highlights of the PolCA+ algorithm

• Algorithmic:– New hierarchical approach for POMDP framework.

– POMDP-specific state and observation abstraction methods.

• Experimental:– First instance of POMDP-based high-level robot controller.

– Novel application of POMDPs to robust dialogue management.

• Theoretical:– For special case (fully observable), guarantees recursive optimality.

[ Pineau, Gordon & Thrun, UAI 2003. Pineau et al., RAS 2003. Roy, Pineau & Thrun, ACL 2001]


Future work

PBVI:

• How can we handle domains with multi-valued state features?

• Can we leverage dimensionality reduction?

• Can we find better ways to pick belief points?

PolCA+:

• Can we automatically learn hierarchies?

• How can we learn (or do without) pseudo-reward functions?


Questions?

Project information:www.cs.cmu.edu/~nursebot

Navigation software:www.cs.cmu.edu/~carmen

Papers and more:www.cs.cmu.edu/~jpineau

Collaborators: Geoffrey Gordon, Judith Matthews, Michael Montemerlo,Martha Pollack, Nicholas Roy, Sebastian Thrunh


Two types of uncertainty

Effect → Stochastic action effects

State → Partial and noisy sensor information


Example: Effect uncertainty

Startposition

Distribution over possiblenext-step positions

Startposition


Motion action


Two types of uncertainty




Example: State uncertainty



Model → Inaccurate parameterization of the environment

Agent → Unknown behaviour of other agents

Startposition



Validation of PBVI’s belief expansion heuristic

No Belief

Hallway domain60 states, 5 actions, 20 observations


0

500

1000

1500

2000

2500

3000

3500

4000

4500

NoAbs PolCA PolCA+

# S

tate

ssubInform

subMove

subContact

subRest

subAssist

subRemind

act

State space reduction

No hierarchy PolCA+


Future directions

• Improving POMDP planning

– sparser belief space sampling, ordered value updating, dimensionality reduction, continuous / hybrid domains

• Addressing two more types of uncertainty:1. Effect2. State3. Model4. Agent

• Exploring new applications

Documents

Joelle Pineau Robotics Institute Carnegie Mellon University Stanford University May 3, 2004