Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

  • Upload

  • View

  • Download

Embed Size (px)


Integrating POMDP and RL for a Two Layer Simulated Robot Architecture. Presented by Alp Sardağ. Two Layer Architecture. The lower layer provides fast, short horizon decision. The lower layer is designed to keep robot out of trouble. - PowerPoint PPT Presentation

Citation preview

Page 1: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Presented by Alp Sardağ

Page 2: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Two Layer Architecture

The lower layer provides fast, short horizon decision.

The lower layer is designed to keep robot out of trouble.

The upper layer ensures that the robot continually works toward its target task or goal.

Page 3: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture


Offers reliability. Reliability: the robot must be able to deal

with failure of sensors and actuators. Hardware failure = mission failure Example, robots operating out of direct

human control: Space exploration Office robot

Page 4: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

The System

It has two levels of control: The lower level controls the actuators that move

the robot around and provides a set of behaviors that can be used by the higher level of control.

The upper level, planning system, plans a sequence of actions in order to move the robot from its current location to the goal.

Page 5: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

The Architecture

The bottom level is accomplished by RL: RL as an incremental learning is able to learn

online. RL can adapt changes in the environment. RL reduce the programmer intervention.

Page 6: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

The Architecture

The higher level is POMDP planner: POMDP planner operates quickly once a policy

is generated. POMDP planner can provide reinforcement

needed by lower level behaviors.

Page 7: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

The Test

For test, the Kephera robot simulator is used. Kephera has limited sensors. It has well-defined environment. The simulator can run much faster than real-

time. The simulator does not require human

intervention for low battery conditions and sensor failures.

Page 8: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Methods for Low-Level Behaviors

Subsumption Learning from examples. Behavioral cloning.

Page 9: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Methods for Low-Level Behaviors

Neural systems tend to be robust to noise and perturbation in the environment.

GeSAM is a neural network based robot hand control system. GeSAM uses adaptive neural network.

Neural networks often require long trainning periods and large amounts of data.

Page 10: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Methods for Low-Level Behaviors

RL can learn continuously. RL provide adaptation to sensor drift and

changes in actuators. In many extreme cases, sensor or actuator

failures adapt enough to allow the robot to accomplish the mission.

Page 11: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Planning at the Top

POMDP deals with the uncertainity. For Kephera, with limited sensors, determining

the exact state is very difficult. Also, the effects of actuators may not be


Page 12: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Some rewards are associated with the goal state.

Some rewards are associated with performing some action in a certain state.

Thus, this will allow to define complex, compound goals.

Planning at the Top

Page 13: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture


The current POMDP solution method: Does not scale well with the size of state space. Exact solutions are only feasible for very small

POMDP planning problems. Requires that the robot be given a map, which is

not always feasible.

Page 14: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

What is Gained?

By combining RL and POMDP, the system is robust to changes.

RL will learn how to use the damaged sensors and actuators.

Continuous learning has some drawbacks when using backpropagation neural networks. Over-trainning.

POMDP adapt to sensor and actuator failures by adjusting the transition probabilities.

Page 15: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

The Simulator

Pulse encoders are not used in this work. The simulation results can be successfully transferred to a

real robot. The sensor model includes stochastic modeling of noise and

responds similarly to the real sensors. The simulation environment includes some stochastic

modeling of wheel slippage and accelaration. Hooks are added into the simulator to allow to simulate

sensor failures. Effector failures are simulated in the code.

Page 16: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

RL Behaviors

Three basic behavior, move forward, turn right and turn left.

The robot is always moving or performing an action.

RL is responsible for dealing: With obstacles, With adjusting sensor or actuator malfunction.

Page 17: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

The goal of the RL module is to maximize the reward given them by the POMDP planner.

The reward is a function how long it took to make a desired state transition.

Each behaviors has its own RL module. Only one RL module can be active in a given time. Q-learning with table lookup for approximating the

value function. Fortunately, the problem so far small enough for table


RL Behaviors

Page 18: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

POMDP planning

Since robots can rarely determine their state from sensor observations, COMDP do not work well in many real-world robot planning tasks.

It is more adequate to use the state probability distribution, and update using the information about transition and obsservation probabilities.

Page 19: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Sensor Grouping

Kephera has 8 sensors that report distance values between 0 and 1024.

The observations are reduced to 16: The sensors are grouped in pairs to make 4

pseudo sensors, Tresholding applied to the output of the sensors.

POMDP planner is now robust to single sensor failures.

Page 20: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Solving a POMDP

Witness algorithm is used to compute the optimal policy for POMDP.

Witness does not scale well with the size of the state space.

Page 21: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Environment and State Space

64 possible state for the robot: 16 discrete positions. Robot’s heading is disceretized into the four compass


Sensor information was reduced to 4 bits by combining the sensors in pairs and thresholding.

Solution to LP required several days on a Sun Ultra 2 workstation.

Page 22: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Environment and State Space

Page 23: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Interface Between Layers

POMDP uses current belief state to select low level behavior to activate.

The implementation tracks the state with the highest probability: the most likely current state.

If the most likely current state changes to the state that POMDP want, a reward of 1, otherwise –1 is generated.

Page 24: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture


Since RLPOMDP is adaptive, the author expect that the overall performance should degrade gracefully as sensors and actuators gradually fails.

Page 25: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture


State 13 is the goal state. POMDP state transition and observation

probabilities obtained by placing the robot in each 64 state and taking each action ten times.

With the policy in place,RL modules are trained in the same way.

For each system configuration (RL or hand coded), the simulation is started from every position and orientation and performance is recorded.

Page 26: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture


Failures during trial evaluating the reliability. Average steps to goal asses the efficiency.

Page 27: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Gradual Sensor Failure

Battery power is used up, dust accumulates on sensors.

Page 28: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Intermittent Actuator Failure

Right motor control signal failed.

Page 29: Integrating POMDP and RL for a Two Layer Simulated Robot Architecture


The RLPOMDP exihibits robust behavior in the presence of sensor and actuator degradation.

Future work scaling the problem. To overcome the scaling problem of table lookup of

RL, neural nets can be used (learnforget cycle). To increase the size of the space for the POMDP,

non-optimal solution algorithms are investigated. New behaviors will be added.