Upload
betty-matthews
View
229
Download
0
Tags:
Embed Size (px)
Citation preview
Closed Loop Temporal Sequence LearningLearning to act in response to
sequences of sensor events
Machine Learning Classical Conditioning Synaptic Plasticity
Dynamic Prog.(Bellman Eq.)
REINFORCEMENT LEARNING UN-SUPERVISED LEARNINGexample based correlation based
d-Rule
Monte CarloControl
Q-Learning
TD( )often =0
ll
TD(1) TD(0)
Rescorla/Wagner
Neur.TD-Models(“Critic”)
Neur.TD-formalism
DifferentialHebb-Rule
(”fast”)
STDP-Modelsbiophysical & network
EVALUATIVE FEEDBACK (Rewards)
NON-EVALUATIVE FEEDBACK (Correlations)
SARSA
Correlationbased Control
(non-evaluative)
ISO-Learning
ISO-Modelof STDP
Actor/Critictechnical & Basal Gangl.
Eligibility Traces
Hebb-Rule
DifferentialHebb-Rule
(”slow”)
supervised L.
Anticipatory Control of Actions and Prediction of Values Correlation of Signals
=
=
=
Neuronal Reward Systems(Basal Ganglia)
Biophys. of Syn. PlasticityDopamine Glutamate
STDP
LTP(LTD=anti)
ISO-Control
Overview over different methods
You are here !
Real Life• Heat radiation predicts pain when touching a hot surface.• Sound of a prey may precede its smell will precede its
taste.
Control• Force precedes position change.• Electrical (disturbance) pulse may precede change in a
controlled plant.
Psychological Experiment• Pavlovian and/or Operant Conditioning: Bell precedes
Food.
Natural Temporal Sequences in life and
in “control” situations
Early: “Bell”
Late: “Food”
x
)( )( )( tytutdt
dii
What we had so far !
This is an open-loopsystem
Sensor 2
conditionedInput
Bell Food Salivation
Pavlov, 1927
Temporal Sequence
This is an Open Loop System !
Adaptable
Neuron
Env.
Closed loop
Sensing Behaving
How to assure behavioral & learning convergence ??
This is achieved by starting with a stable reflex-like actionand learning to supercede it by an anticipatory action.
Remove beforebeing hit !
Robot Application
x
Early: “Vision”
Late: “Bump”
C o n tro lle rC o n tro lled
S ystemC o n tro lS ign a ls
F e edb ack
D is tu rba ncesS e t-P o in t
X 0
Reflex Only
(Compare to an electronic closed loop controller!)
This structure assures initial (behavioral) stability (“homeostasis”)
Think of a Thermostat !
This is a closed-loop
system before learning
The Basic Control StructureSchematic diagram of A pure reflex loop
Bump
Retraction
reflex
Learning Goal:Correlate the vision signalswith the touch signals andnavigate without collisions.
Initially built-in behavior:Retraction reaction whenever an obstacle is touched.
Robot Application
Lets look at the Reflex first
This is an open-loopsystem
This is a closed-loop
system before learning
This is a closed-loop
system during learning
Closed LoopTS Learning
The T represents the temporal delay between vision and bump.
Lets see how learning continues
Closed LoopTS Learning
Antomically this loop still
exists but ideally it sh
ould
never be active again !
This is th
e system
after l
earning
What has the learning done ?
Elimination of the late reflex input corresponds mathematically to the condition:
X0=0
This was the major condition of convergence for the ISO/ICO rules. (The other one was T=0).
What’s interesting about this is that the learning-induced behavior self-generates this condition. This is non-trivial as it assures the synaptic weight AND behavior stabilize at exactly the same moment in time.
This is t
he syste
m after
learn
ing.
This is the systemafter learning
What has happened inengineering terms?
The inner pathway has now become a purefeed-forward path
Formally:
X 0 = P0[V+ Deà sT]
X 1 = 1à P1P01HV
P1(D+P01X 0H0)
1à P1P01HV
P1(D+P01X 0H0)+ HV
X 0 = eà sTD
Delta pulse Disturbance DCalculations in the Laplace Domain!
HV = à Pà 11 eà sT
1à P01eàsT1
HV = à Pà 11 eà sT
1à P01eàsT1
Phase Shift
The Learner has learned the inverse transfer function of the world and can compensate the disturbance therefore at the summation node!
D
P1
-e-sT
e-sT
0
The out path has, through learning, built a Forward Model of the inner path.
This is some kind of a “holy grail” for an engineer called:
“Model Free Feed-Forward Compensation”
For example: If you want to keep the temperature in an outdoors container constant, you can employ a thermostat.
You may however also first measure how the sun warms the liquid in the container, relating outside temperature to inside temperature (“Eichkurve”).
From this curve you can extract a control signal for the cooling of the liquid and react BEFORE the liquid inside starts warming up.
As you react BEFORE, this is called Feed-Forward-Compensation (FFC).
As you have used a curve, have performed Model-Based FFC.
Our robot learns the model, hence we do Model Free FFC.
Example of ISO instability as compared to ICO stability !
Remember the Auto-Correlation Instability of ISO??
= u1 v’d
dt
= u1 u0’d
dtICO
ISO
1 C
o SetpointLevel
D isturbanceH eat-Pulse
40 sec
Tem perature C om pensationbefore & a fter
IC O -Learning
C ontro l Param eters
41
42
43
44
45
46
47
48
49
50
51
Tem
per
atur
e [C
]
Setpoint: 45o
C old w ateradded
17000 sec
Old ISO-Learning
0.0
0.1
-0.1
-0.2
042
43
44
45
46
47
48
49
50
200 400 600 800
Se tp o int 45Pulse 12 se cle a rnra te 7 .5E-07g a in 150nfilte rs 15Flo w o f wa te r: 750 m l in 60 se cWa te r te m p : 17 d e g .C o ntro l sa m e a s in ic o g ut08C o ntro l sho ws tha t se tp o int is e xtre m ly ha rd to re a c h,le a rning he lp s a lso he re a lo t
C o ntro l
New ICO-Learning
-0.3
0.2
0.0
0.0
-0.4
-0.2
55
56
57
58
59
60
61
62
63
64
0 500 1000 1500 2000 2500 3000 3500
66
68
70
72
74
76
640 500 1000 1500 2000 2500
Se tp o int 60Pulse 10 se cle a rnra te 4E-08g a in 250Flo w o f wa te r: 400 m l/m inWa te r te m p : 17 d e g .
Se tp o int 70Pulse 20 se cle a rnra te 4E-07g a in 250Flo w o f wa te r: 750 m l in 60 se cWa te r te m p : 17 d e g .C o m m e nt: O NE SHO T Le a rning
C o ntro l
C o ntro l
Full compensation
One-shot learning
Over compensation
Statistical Analysis:Conventional differential hebbian learning
One-shot learning(some instances)
Highly unstable
Stable in a wide domain
One-shot learning(common)
Statistical Analysis:Input Correlation Learning
Stable learning possiblewith about 10-100 foldhigher learning rate
Two more applications: RunBot
With different convergent conditions!
Walking is a very difficult problem as such and requires a good, multi-layered control structure only on top of which learning can take place.
An excursion into Multi-Layered Neural Network Control
Some statements:
• Animals are behaving agents. They receive sensor inputs from MANY sensors and they have to control the reaction of MANY muscles.
• This is a terrific problem and solved by out brain through many control layers.
• Attempts to achieve this with robots have so far been not very successful as network control is somewhat fuzzy and unpredictable.
Central Control(Brain)
Spinal Cord(Oscillations )
Motorneurons & Sensors(Reflex generation)
Muscles & Skeleton(Biomechanics)
Muscle length
Ground-contact
Step controlTerrain control
Adaptive Control during WalikingThree Loops
Muscles & Skeleton(Biomechanics)
Self-Stabilization by passiveBiomechanical Properties
(Pendelum)
( )A
( )B
( )C
( )D
036
- 3- 6
036
- 3- 6
03
- 3
03
- 3
Passive Walking Properties
Motor neurons & Sensors(Reflex generation)
Muscles & Skeleton(Biomechanics)
Muscle length
RunBot’s Network of the lowest loop
FS
ES
GL GR
ES FSAL FS ESAR
FM
EM
ES
FS
EM
FM
FMEM EMFM
LeftHip Motor RightHip Motor
LeftKnee Motor RightKnee Motor
Motor neuron Sensor neuron/
receptorExcitatorysynapse
Inhibitorysynapse
N NN N
Hip Motor M Hip Motor MKnee Motor M Knee Motor M
S
EFEF
A
Leg Control
-10 10
10 -10-10
-1 1
-30
1 -1
-30-30 10 10-10 -15 15
G
NN N FEF
A
10 -10
-10
-1 1
-30-30
1 -1
-30 10 10-10 -15
G
NE
-10 10
Left Leg Right Leg
S S S
-30 15
SS S S-30
Hip Knee KneeHip
Leg Control of RunBotReflexive Control (cf. Cruse)
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13)
1 2 3 4 t 5 6 7 8Time s ( )( )B
(A)
Instantaneous Parameter Switching
Body (UBC) Control of RunBot
N
o toB dy Mo r M
EF
Body Control
-1 1
UBC
Central Control(Brain)
Terrain control
Long-Loop Reflex of one of the upper loops
Before or during a fall:Leaning forward of rumpand arms
Central Control(Brain)
Terrain control
Long-Loop Reflexe der obersten Schleife
Forward leaningUBC
Backward leaningUBC
Motor neuron Sensor neuron/
receptorExcitatorysynapse
Inhibitorysynapse
N NN N
Hip Motor M Hip Motor MKnee Motor M Knee Motor M
S
EFEF
A
Leg Control
-10 10
10 -10-10
-1 1
-30
1 -1
-30-30 10 10-10 -15 15
G
NN N FEF
A
10 -10
-10
-1 1
-30-30
1 -1
-30 10 10-10 -15
G
NE
-10 10
Left Leg Right Leg
S S S
-30 15
SS S S-30
Hip Knee KneeHip
The Leg Control as Target of the Learning
S
S
A
A
N
N
AS
Left, R ight
H ips
Knees
N
N
IR
Leg contro l
Body control
Learner neurons
Sensor
L1
L2
L3
L4
L5
L6
S
E
F
changeab le synapse
Excitatory synapse
Inhibito ry synapse
Inhibito ry synapse(shunting, divisive)
Target neurons
UBC
S
F
E
F
E
E
F
1
- 1
- 1
- 1
- 1
- 1
1
1
11
Learning in RunBot
d/dt
X
ρ6
AS
IR
L6 ρ60 = 1
1
u
u1
0
Differential HebbianLearning related toSpike-Timing DependentPlasticity
RunBot: Learning to climb a slope
Spektrum der Wissenschaft, Sept. 06
Human walking versus machine walking
ASIMO by Honda
60
90
120
150
Left
Hip J
oint
Ang
le (d
eg)
80
120
160
200
60
90
120
150
Rig
ht H
ip J
oint
Ang
le (d
eg)
Left
Knee
Joi
nt A
ngle (d
eg)
80
120
160
200
Rig
ht K
nee
Join
t Ang
le (d
eg)
A B
C D
Lower floor
Upper floor
Ramp Falls
Manages
RunBot’sJoints
Left LegRight Leg
Time
StanceSwing
(1) (2) (3)
Change of Gait when walking upwards
-40
0
40
80
120
0.0
0.4
0.8
1.2
Acce
lero
met
er S
enso
r Signa
l
0.0
0.4
0.8
1.2
Infra
red
Sens
or S
igna
l
Upp
er B
ody Com
pone
nt (d
eg)
Time (s)4 8 12 16 20 24 28
0
5
10
15
Wei
ghts
Time (s)4 8 12 16 20 24 28
ρ3
ρ 6
ρ 2
ρ 5
ρ1
ρ 4
E F
G H1
1
1
1
1
1
,
Too late Early enough
X
X0=0: Weight stabilize togetherwith the behaviourLearning relevant parameters
AL, (AR) Stretch receptor for anterior angle of left (right) hip; GL, (GR) Sensor neuron for ground contact of left (right) foot; EI (FI) Extensor (Flexor) reflex inter-neuron, EM (FM) Extensor (Flexor) reflex motor-neuron; ES (FS) Extensor (Flexor) reflex sensor neuron
A different Learning Control Circuitry
FMEM
ES FS
EMFM
FS ES
FMEM EMFM
GL GR
Left Hip Motor Right Hip Motor
Left Knee Motor Right Knee Motor
ES FSAL FS ESAR
Sensor Neuron orReceptor
Learning-ControlNeuron L,Interneuron I
Motor neuron-
InhibitorySynapse(subtractive)
InhibitorySynapse(shunting,divisive)
ExcitatorySynapse
r1t t
V
X 1
X 0 LL
II
M
Motor Neuron Signal Structure (spike rate)
X
X
r1
r 0
h
h
V1
0
1
0
ESorEM
j0
x0 x0
j1
x1 x1
STDP .
x0 x1
Left Leg
Right Leg
Learning reverses
pulse sequence
Stability of STDP
Motor Neuron Signal Structure (spike rate)
X
X
r1
r 0
h
h
V1
0
1
0
ESorEM
j0
x0 x0
j1
x1 x1
STDP .
x0 x1
Left Leg
Right Leg
Learning reverses
pulse sequence
Stability of STDP
The actual ExperimentNote: Here our convergence condition has been T=0(oscillatory!) and NOT as usual X0=0
0
1
2
3
50
60
70
80
90
90
95
100
105
110
115
1.8
2
2.2
2.4
2.6
2.8
3
0 020 2040 4060 60
Spe
ed (c
m/s
)
Gai
nT
hres
hold
(de
g)
Wei
ght
ρ1 ESQ
MG
A B
C D
t [s] t [s]
Speed Change by Learning
Note th
e
Oscill
atio
ns !
T ≈ 0 !
Stable learning possiblewith about 10-100 foldhigher learning rate
= u1 v’d
dt
= u1 u0’d
dt
Driving School: Learn to follow a trackand develop RFs in a closed loop behavioral context
Central Features: • Filtering of all inputs with low-pass filters
(membrane property)• Reproduces asymmetric weight change curve
)( )( )( tytutdt
dii
Derivative of the Output
Filtered Input
ISO-learning for temporal sequence learning:(Porr and Wörgötter, 2003)
T
Differential Hebbian learning used for STDP
)( )( )( tytutdt
dii
Differential Hebbian Learning rule:
Transfer of a computational neuroscience approach to a technical domain.
Modelling STDP (plasticity) and using this approach (in a simplified way) to control
behavioral learning.
An ecological approach to some aspects of
theoretical neuroscience
This is a closed-loop
system before learning
This is a closed-loop
system during learning
Adding a secondary loop
This delay constitutes the temporal
sequence of sensor events !
Pavlov in“real life” !
ControllerControlled
SystemControlSignals
Feedback
DisturbancesSet-Point
X0X1early late
What has happened during learningto the system ?
The primary reflex re-action has effectively been eliminatedand replaced by an anticipatory action