38
Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

  • View
    220

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

Octopus Arm ProjectFinal Presentation

Dmitry Volkinshtein & Peter Szabo

Supervised by: Yaki Engel

Page 2: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

Contents1. Project Goal

2. The Octopus and its Arm

3. Octopus Arm Model

4. Reinforcement Learning

5. On-Line GPTD Algorithm

6. System Implementation

7. Debug and Testing

8. Learning Process

9. Challenges and Problems

10. Results

11. Conclusions

Page 3: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

1 .Project Goal

Teach a 2D octopus arm model to reach a given point in space by novel means of reinforcement learning.

Page 4: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

2. The Octopus and its Arm (1/2)

An Octopus is a carnivorous eight-armed sea creature with a large soft head and two rows of suckers on the underside of each arm (usually lives on the bottom of the ocean).

Page 5: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

2 .The Octopus and its Arm (2/2)

An Octopus Arm is a muscular hydrostat organ capable of exerting force with the sole use of muscles, without requiring a rigid skeleton.

Page 6: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

The octopus arm’s model is the outcome of a previous project in this lab. It is a 2D model written in C.

We have upgraded it to C++ and added some features for better performance and compliance with other parts of our project.

3. Octopus Arm Model (1/3)

Page 7: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

The model simulates the physical behavior of the arm in its natural environment, considering:

Muscle forces. Internal forces (keeping the constant volume of the

arm, which is filled with fluids). Gravitation and the floatation vertical forces. Drag forces caused by the water.

3. Octopus Arm Model (2/3)

Page 8: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

3. Octopus Arm Model (3/3)

The model approximates the continuous octopus arm by dividing it into segments and calculating the forces on each segment separately.

Page 9: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

4. Reinforcement Learning (1/4)

Reinforcement Learning (RL)

Learning a behavior through trial and error interactions with a dynamic environment.

Basic RL DefinitionsAgent - A learning and decision making entity.

Environment - An entity the agent interacts with, comprising everything that cannot be changed arbitrarily by the agent.

Page 10: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

4 .Reinforcement Learning (2/4)

Basic RL Definitions (cont.)

State - The condition of the environment.

Action - A choice made by the agent, based on the state.

Reward - The input upon which the actions are evaluated by the agent.

Page 11: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

4. Reinforcement Learning (3/4)

The agent uses two functions to produce its next action:

Value - For a given state X, the Value function is the expectation sum of rewards when the agent starts from X and uses the current Policy.

Policy - For a given Value function, the Policy function returns the action to take when the agent is located in state X.

Page 12: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

4. Reinforcement Learning (4/4)

In our case: The State is the location and speed of all the vertexes in

the arm. An arm represented by N segments will have 2(N+1) vertexes and 8(N+1) dimensions.

The Action is a muscle activation in the arm, chosen by the agent module from a finite and constant set of activations.

The Environment is the arm simulation that provides the result of the activation (the next state of the arm).

The Reward the agent gets is state dependant, and set according to the proximity between arm and goal.

The Agent is an On-Line GPTD based learning module.

Page 13: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

5 .On-Line GPTD Algorithm (1/3)

TD() - An algorithm family in which temporal differences are used to estimate the value function on-line.

GPTD - Gaussian Processes for TD Learning:

Assuming the sequence of rewards is a Gaussian random process (with noise), and the rewards are samples of that process, we can estimate the value function using Gaussian estimation and a kernel function.

Page 14: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

5 .On-Line GPTD Algorithm (2/3)

GPTD disadvantages: Space consumption of O(t2). Time consumption of O(t3).

The proposed solution:

On-Line Sparsification applied on the GPTD algorithm. Instead of keeping a large number of results of a vector function, we keep a dictionary of input vectors that can span, up to an accuracy threshold, the original vector function’s space.

Page 15: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

5. On-Line GPTD Algorithm (3/3)

Applying on-line sparsification on GPTD yields: Recursive update rules. No matrix inversion needed. Matrix dimensions depend on mt (dictionary size at

time t), which is generally not linear in t.

Using this, we can calculate: Value estimate with O(mt) time.

Value variance with O(mt2) time.

Each dictionary update with O(mt2) time.

Page 16: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

6. System Implementation (1/5)

Octopus Arm

Simulation

Environment

Explorer

(OPI / Actor-Critic)

Agent

On-Line GPTD

Rewarder

12

3

3

3

4

5

4

Page 17: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

6 .System Implementation (2/5)

The Agent• Uses the On-line GPTD algorithm for value estimation.• Policy by one of the following:

1. Optimistic Policy Iteration (OPI). In this case there is a choice between two methods:

1. The - greedy policy:

2. The Softmax policy:

2. Actor-Critic - By using two GPTD instances.

3. Interval Estimation - Greedy over:

0

0 goals

CN

( )

( )

1( ) ;

i

n

V s

iV s

n

eP a

Te

2( )( )

ii V sV s C

Page 18: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

6 .System Implementation (3/5)

The GPTD algorithm’s two versions

1. Deterministic - Calculates value for an MDP with deterministic state transitions.• Works better with Actor-Critic.• Relatively fast convergence.

2. Stochastic - Calculates value for an MDP with stochastic state transitions.

• Works better with Softmax and - greedy.

• Slow convergence (more parameters).

Page 19: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

6. System Implementation (4/5)

Goal types

• The goal is a small circular zone in the 2D space.

• It can be reached in two predefined ways:

1. With the tip of the arm.

2. With any vertex along the arm.

• In both cases it can be reached with either side of the arm (dorsal or ventral).

• A speed limit can be applied, i.e. reaching the goal zone faster than limit will not count as goal.

Page 20: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

6. System Implementation (5/5)

The RewarderThere are three types of rewarding:

1. Discrete:

2. Exponential:

3. Negative exponential:

In all cases:► Hitting an obstacle yields a reward of , and ends the

trial.► A penalty proportional to the arm’s energy consumption in the

last transition is reduced from the reward.

10 , 1goal no goalr r 2

10 , ( ) 10 dgoal no goalr r d e

2

10 , ( ) 11 1dgoal no goalr r d e

goalr

Page 21: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

7. Debug and Testing• GPTD implementation included the non-recursive

version of the algorithm (used only for debugging).

• We tested GPTD (both deterministic and stochastic) on a discrete maze and witnessed correct value convergence.

• We tested the system with a short, degenerate arm of 3 segments.

• We ran the system in Supervised Learning mode (with ) and witnessed the value converging to the reward function.

Page 22: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

8. Learning Process (1/4)

Agent’s Learning

1. Running a large number (~103-104) of consecutive learning trials. Each trial starts from a random initial state.

2. Saving GPTD parameters each N-th trial.

Page 23: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

8. Learning Process (2/4)

Agent’s Performance Measurement1. Creating a pool of evenly distributed arm states.

2. Loading each N-th GPTD save.

3. Running greedy policy over the manifold of initial states, noting whether and when was the goal reached.

4. Merging the raw data to statistics - goal percentage for each GPTD save, and goal times for each initial state.

Page 24: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

8. Learning Process (3/4)

Parameters’ Calibration

Simulation Time - Total time, interval between activations. Activations - Total number and types.

GPTD Stochastic vs. Deterministic General parameters -

Kernel function - Gaussian, Polynomial

Page 25: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

8. Learning Process (4/4)

Parameters’ Calibration (cont.)

Exploration OPI - Softmax temperature, - greedy descent

factors. Actor-Critic - Trial interval between GPTD switches. Interval Estimation - Variance multiplying factor.

Reward Function Discrete vs. Exponential Energy penalty factor Obstacles

Goal Size Location Speed limit

Page 26: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

9. Challenges and Problems

► A 10 segment arm has 88 dimensions. It uses a huge amount of memory and lengthens simulation time. Solution: Occupy all of the lab’s available computers for months.

► Too many activations cause greedy moves to consume a lot of time. Solution: Creating a small set of activations, yet one that still supports the arm’s full range of movement.

► GPTD is an estimator for an MDP value function. If we use OPI, the policy is not MDP. Solution: Forgetfulness () or Actor-Critic.

► Starting from a random state improves learning, yet a random state’s production on-line is problematic with our simulation. Solution: Selecting random states from GPTD’s dictionary, and using the less visited amongst them.

Page 27: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

10 .Results (1/8)

10 Segment Arm - OPI, Gaussian Kernel

Page 28: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

10 .Results (2/8)

10 Segment Arm - OPI, Polynomial Kernel

Page 29: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

10 .Results (3/8)

10 Segment Arm - OPI, Polynomial Kernel

Page 30: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

10 .Results (4/8)

10 Segment Arm - OPI, Polynomial Kernel

92%

Page 31: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

10 .Results (5/8)

10 Segment Arm - OPI, Polynomial Kernel

96%

Page 32: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

10 .Results (6/8)

10 Segment Arm - OPI, Polynomial Kernel

90%

Page 33: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

10 .Results (7/8)

10 Segment Arm - Actor-Critic, Gaussian Kernel

96%

Page 34: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

10 .Results (8/8)

Summary

1. OPI yields a peak performance of 96%.

2. Actor-Critic yields a peak performance of 96%.

GOAL !!!

GOAL !!! GOAL !!!

Page 35: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

11. Conclusions (1/2)

1. It is evident that the arm has gone through a significant learning process. Though not to perfection, but the arm clearly improved its behavior, from random movements to goal oriented movements.

2. OPI and Actor-Critic both prove to be effective learning strategies. Some further investigation is needed to establish the performance differences between the two.

3. Despite some disappointing results, we are optimistic (hence still using OPI ;-). We believe that further fine-tuning of system parameters, and providing longer training time will result in a more decisive convergence.

Page 36: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

11. Conclusions (2/2)

4. The system’s large number of dimensions slows down the simulation and inflates the dictionary. This is the current bottleneck of the learning process. More computing power is needed for a reasonable learning time.

5. Based on our latest results, we believe we will establish that On-Line GPTD is practical algorithm which tackles successfully, and with moderate computing resources, the difficult problem of reinforcement learning in a continuous state space with vast number of dimensions.

Page 37: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

Questions Questions ??

Page 38: Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

THE ENDTHE END