Closed Loop Temporal Sequence Learning Learning to act in response to sequences of sensor events

Preview:

Citation preview

Closed Loop Temporal Sequence LearningLearning to act in response to

sequences of sensor events

Machine Learning Classical Conditioning Synaptic Plasticity

Dynamic Prog.(Bellman Eq.)

REINFORCEMENT LEARNING UN-SUPERVISED LEARNINGexample based correlation based

d-Rule

Monte CarloControl

Q-Learning

TD( )often =0

ll

TD(1) TD(0)

Rescorla/Wagner

Neur.TD-Models(“Critic”)

Neur.TD-formalism

DifferentialHebb-Rule

(”fast”)

STDP-Modelsbiophysical & network

EVALUATIVE FEEDBACK (Rewards)

NON-EVALUATIVE FEEDBACK (Correlations)

SARSA

Correlationbased Control

(non-evaluative)

ISO-Learning

ISO-Modelof STDP

Actor/Critictechnical & Basal Gangl.

Eligibility Traces

Hebb-Rule

DifferentialHebb-Rule

(”slow”)

supervised L.

Anticipatory Control of Actions and Prediction of Values Correlation of Signals

=

=

=

Neuronal Reward Systems(Basal Ganglia)

Biophys. of Syn. PlasticityDopamine Glutamate

STDP

LTP(LTD=anti)

ISO-Control

Overview over different methods

You are here !

Real Life• Heat radiation predicts pain when touching a hot surface.• Sound of a prey may precede its smell will precede its

taste.

Control• Force precedes position change.• Electrical (disturbance) pulse may precede change in a

controlled plant.

Psychological Experiment• Pavlovian and/or Operant Conditioning: Bell precedes

Food.

Natural Temporal Sequences in life and

in “control” situations

Early: “Bell”

Late: “Food”

x

)( )( )( tytutdt

dii

What we had so far !

This is an open-loopsystem

Sensor 2

conditionedInput

Bell Food Salivation

Pavlov, 1927

Temporal Sequence

This is an Open Loop System !

Adaptable

Neuron

Env.

Closed loop

Sensing Behaving

How to assure behavioral & learning convergence ??

This is achieved by starting with a stable reflex-like actionand learning to supercede it by an anticipatory action.

Remove beforebeing hit !

Robot Application

x

Early: “Vision”

Late: “Bump”

C o n tro lle rC o n tro lled

S ystemC o n tro lS ign a ls

F e edb ack

D is tu rba ncesS e t-P o in t

X 0

Reflex Only

(Compare to an electronic closed loop controller!)

This structure assures initial (behavioral) stability (“homeostasis”)

Think of a Thermostat !

This is a closed-loop

system before learning

The Basic Control StructureSchematic diagram of A pure reflex loop

Bump

Retraction

reflex

Learning Goal:Correlate the vision signalswith the touch signals andnavigate without collisions.

Initially built-in behavior:Retraction reaction whenever an obstacle is touched.

Robot Application

Lets look at the Reflex first

This is an open-loopsystem

This is a closed-loop

system before learning

This is a closed-loop

system during learning

Closed LoopTS Learning

The T represents the temporal delay between vision and bump.

Lets see how learning continues

Closed LoopTS Learning

Antomically this loop still

exists but ideally it sh

ould

never be active again !

This is th

e system

after l

earning

What has the learning done ?

Elimination of the late reflex input corresponds mathematically to the condition:

X0=0

This was the major condition of convergence for the ISO/ICO rules. (The other one was T=0).

What’s interesting about this is that the learning-induced behavior self-generates this condition. This is non-trivial as it assures the synaptic weight AND behavior stabilize at exactly the same moment in time.

This is t

he syste

m after

learn

ing.

This is the systemafter learning

What has happened inengineering terms?

The inner pathway has now become a purefeed-forward path

Formally:

X 0 = P0[V+ Deà sT]

X 1 = 1à P1P01HV

P1(D+P01X 0H0)

1à P1P01HV

P1(D+P01X 0H0)+ HV

X 0 = eà sTD

Delta pulse Disturbance DCalculations in the Laplace Domain!

HV = à Pà 11 eà sT

1à P01eàsT1

HV = à Pà 11 eà sT

1à P01eàsT1

Phase Shift

The Learner has learned the inverse transfer function of the world and can compensate the disturbance therefore at the summation node!

D

P1

-e-sT

e-sT

0

The out path has, through learning, built a Forward Model of the inner path.

This is some kind of a “holy grail” for an engineer called:

“Model Free Feed-Forward Compensation”

For example: If you want to keep the temperature in an outdoors container constant, you can employ a thermostat.

You may however also first measure how the sun warms the liquid in the container, relating outside temperature to inside temperature (“Eichkurve”).

From this curve you can extract a control signal for the cooling of the liquid and react BEFORE the liquid inside starts warming up.

As you react BEFORE, this is called Feed-Forward-Compensation (FFC).

As you have used a curve, have performed Model-Based FFC.

Our robot learns the model, hence we do Model Free FFC.

Example of ISO instability as compared to ICO stability !

Remember the Auto-Correlation Instability of ISO??

= u1 v’d

dt

= u1 u0’d

dtICO

ISO

1 C

o SetpointLevel

D isturbanceH eat-Pulse

40 sec

Tem perature C om pensationbefore & a fter

IC O -Learning

C ontro l Param eters

41

42

43

44

45

46

47

48

49

50

51

Tem

per

atur

e [C

]

Setpoint: 45o

C old w ateradded

17000 sec

Old ISO-Learning

0.0

0.1

-0.1

-0.2

042

43

44

45

46

47

48

49

50

200 400 600 800

Se tp o int 45Pulse 12 se cle a rnra te 7 .5E-07g a in 150nfilte rs 15Flo w o f wa te r: 750 m l in 60 se cWa te r te m p : 17 d e g .C o ntro l sa m e a s in ic o g ut08C o ntro l sho ws tha t se tp o int is e xtre m ly ha rd to re a c h,le a rning he lp s a lso he re a lo t

C o ntro l

New ICO-Learning

-0.3

0.2

0.0

0.0

-0.4

-0.2

55

56

57

58

59

60

61

62

63

64

0 500 1000 1500 2000 2500 3000 3500

66

68

70

72

74

76

640 500 1000 1500 2000 2500

Se tp o int 60Pulse 10 se cle a rnra te 4E-08g a in 250Flo w o f wa te r: 400 m l/m inWa te r te m p : 17 d e g .

Se tp o int 70Pulse 20 se cle a rnra te 4E-07g a in 250Flo w o f wa te r: 750 m l in 60 se cWa te r te m p : 17 d e g .C o m m e nt: O NE SHO T Le a rning

C o ntro l

C o ntro l

Full compensation

One-shot learning

Over compensation

Statistical Analysis:Conventional differential hebbian learning

One-shot learning(some instances)

Highly unstable

Stable in a wide domain

One-shot learning(common)

Statistical Analysis:Input Correlation Learning

Stable learning possiblewith about 10-100 foldhigher learning rate

Two more applications: RunBot

With different convergent conditions!

Walking is a very difficult problem as such and requires a good, multi-layered control structure only on top of which learning can take place.

An excursion into Multi-Layered Neural Network Control

Some statements:

• Animals are behaving agents. They receive sensor inputs from MANY sensors and they have to control the reaction of MANY muscles.

• This is a terrific problem and solved by out brain through many control layers.

• Attempts to achieve this with robots have so far been not very successful as network control is somewhat fuzzy and unpredictable.

Central Control(Brain)

Spinal Cord(Oscillations )

Motorneurons & Sensors(Reflex generation)

Muscles & Skeleton(Biomechanics)

Muscle length

Ground-contact

Step controlTerrain control

Adaptive Control during WalikingThree Loops

Muscles & Skeleton(Biomechanics)

Self-Stabilization by passiveBiomechanical Properties

(Pendelum)

( )A

( )B

( )C

( )D

036

- 3- 6

036

- 3- 6

03

- 3

03

- 3

Passive Walking Properties

Motor neurons & Sensors(Reflex generation)

Muscles & Skeleton(Biomechanics)

Muscle length

RunBot’s Network of the lowest loop

FS

ES

GL GR

ES FSAL FS ESAR

FM

EM

ES

FS

EM

FM

FMEM EMFM

LeftHip Motor RightHip Motor

LeftKnee Motor RightKnee Motor

Motor neuron Sensor neuron/

receptorExcitatorysynapse

Inhibitorysynapse

N NN N

Hip Motor M Hip Motor MKnee Motor M Knee Motor M

S

EFEF

A

Leg Control

-10 10

10 -10-10

-1 1

-30

1 -1

-30-30 10 10-10 -15 15

G

NN N FEF

A

10 -10

-10

-1 1

-30-30

1 -1

-30 10 10-10 -15

G

NE

-10 10

Left Leg Right Leg

S S S

-30 15

SS S S-30

Hip Knee KneeHip

Leg Control of RunBotReflexive Control (cf. Cruse)

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13)

1 2 3 4 t 5 6 7 8Time s ( )( )B

(A)

Instantaneous Parameter Switching

Body (UBC) Control of RunBot

N

o toB dy Mo r M

EF

Body Control

-1 1

UBC

Central Control(Brain)

Terrain control

Long-Loop Reflex of one of the upper loops

Before or during a fall:Leaning forward of rumpand arms

Central Control(Brain)

Terrain control

Long-Loop Reflexe der obersten Schleife

Forward leaningUBC

Backward leaningUBC

Motor neuron Sensor neuron/

receptorExcitatorysynapse

Inhibitorysynapse

N NN N

Hip Motor M Hip Motor MKnee Motor M Knee Motor M

S

EFEF

A

Leg Control

-10 10

10 -10-10

-1 1

-30

1 -1

-30-30 10 10-10 -15 15

G

NN N FEF

A

10 -10

-10

-1 1

-30-30

1 -1

-30 10 10-10 -15

G

NE

-10 10

Left Leg Right Leg

S S S

-30 15

SS S S-30

Hip Knee KneeHip

The Leg Control as Target of the Learning

S

S

A

A

N

N

AS

Left, R ight

H ips

Knees

N

N

IR

Leg contro l

Body control

Learner neurons

Sensor

L1

L2

L3

L4

L5

L6

S

E

F

changeab le synapse

Excitatory synapse

Inhibito ry synapse

Inhibito ry synapse(shunting, divisive)

Target neurons

UBC

S

F

E

F

E

E

F

1

- 1

- 1

- 1

- 1

- 1

1

1

11

Learning in RunBot

d/dt

X

ρ6

AS

IR

L6 ρ60 = 1

1

u

u1

0

Differential HebbianLearning related toSpike-Timing DependentPlasticity

RunBot: Learning to climb a slope

Spektrum der Wissenschaft, Sept. 06

Human walking versus machine walking

ASIMO by Honda

60

90

120

150

Left

Hip J

oint

Ang

le (d

eg)

80

120

160

200

60

90

120

150

Rig

ht H

ip J

oint

Ang

le (d

eg)

Left

Knee

Joi

nt A

ngle (d

eg)

80

120

160

200

Rig

ht K

nee

Join

t Ang

le (d

eg)

A B

C D

Lower floor

Upper floor

Ramp Falls

Manages

RunBot’sJoints

Left LegRight Leg

Time

StanceSwing

(1) (2) (3)

Change of Gait when walking upwards

-40

0

40

80

120

0.0

0.4

0.8

1.2

Acce

lero

met

er S

enso

r Signa

l

0.0

0.4

0.8

1.2

Infra

red

Sens

or S

igna

l

Upp

er B

ody Com

pone

nt (d

eg)

Time (s)4 8 12 16 20 24 28

0

5

10

15

Wei

ghts

Time (s)4 8 12 16 20 24 28

ρ3

ρ 6

ρ 2

ρ 5

ρ1

ρ 4

E F

G H1

1

1

1

1

1

,

Too late Early enough

X

X0=0: Weight stabilize togetherwith the behaviourLearning relevant parameters

AL, (AR) Stretch receptor for anterior angle of left (right) hip; GL, (GR) Sensor neuron for ground contact of left (right) foot; EI (FI) Extensor (Flexor) reflex inter-neuron, EM (FM) Extensor (Flexor) reflex motor-neuron; ES (FS) Extensor (Flexor) reflex sensor neuron

A different Learning Control Circuitry

FMEM

ES FS

EMFM

FS ES

FMEM EMFM

GL GR

Left Hip Motor Right Hip Motor

Left Knee Motor Right Knee Motor

ES FSAL FS ESAR

Sensor Neuron orReceptor

Learning-ControlNeuron L,Interneuron I

Motor neuron-

InhibitorySynapse(subtractive)

InhibitorySynapse(shunting,divisive)

ExcitatorySynapse

r1t t

V

X 1

X 0 LL

II

M

Motor Neuron Signal Structure (spike rate)

X

X

r1

r 0

h

h

V1

0

1

0

ESorEM

j0

x0 x0

j1

x1 x1

STDP .

x0 x1

Left Leg

Right Leg

Learning reverses

pulse sequence

Stability of STDP

Motor Neuron Signal Structure (spike rate)

X

X

r1

r 0

h

h

V1

0

1

0

ESorEM

j0

x0 x0

j1

x1 x1

STDP .

x0 x1

Left Leg

Right Leg

Learning reverses

pulse sequence

Stability of STDP

The actual ExperimentNote: Here our convergence condition has been T=0(oscillatory!) and NOT as usual X0=0

0

1

2

3

50

60

70

80

90

90

95

100

105

110

115

1.8

2

2.2

2.4

2.6

2.8

3

0 020 2040 4060 60

Spe

ed (c

m/s

)

Gai

nT

hres

hold

(de

g)

Wei

ght

ρ1 ESQ

MG

A B

C D

t [s] t [s]

Speed Change by Learning

Note th

e

Oscill

atio

ns !

T ≈ 0 !

Stable learning possiblewith about 10-100 foldhigher learning rate

= u1 v’d

dt

= u1 u0’d

dt

Driving School: Learn to follow a trackand develop RFs in a closed loop behavioral context

Central Features: • Filtering of all inputs with low-pass filters

(membrane property)• Reproduces asymmetric weight change curve

)( )( )( tytutdt

dii

Derivative of the Output

Filtered Input

ISO-learning for temporal sequence learning:(Porr and Wörgötter, 2003)

T

Differential Hebbian learning used for STDP

)( )( )( tytutdt

dii

Differential Hebbian Learning rule:

Transfer of a computational neuroscience approach to a technical domain.

Modelling STDP (plasticity) and using this approach (in a simplified way) to control

behavioral learning.

An ecological approach to some aspects of

theoretical neuroscience

This is a closed-loop

system before learning

This is a closed-loop

system during learning

Adding a secondary loop

This delay constitutes the temporal

sequence of sensor events !

Pavlov in“real life” !

ControllerControlled

SystemControlSignals

Feedback

DisturbancesSet-Point

X0X1early late

What has happened during learningto the system ?

The primary reflex re-action has effectively been eliminatedand replaced by an anticipatory action

Recommended