122
At the dendrite the incoming signals arrive (incoming currents) Molekules Synapses Neurons Local Nets Areas Systems CNS At the soma current are finally integrated. At the axon hillock action potential are generated if the potential crosses the membrane threshold The axon transmits (transports) the action potential to distant sites At the synapses are the outgoing signals transmitted onto the dendrites of the target neurons Structure of a Neuron:

Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

1

At the dendrite the incoming signals arrive (incoming currents)

Molekules

Synapses

Neurons

Local Nets

Areas

Systems

CNS

At the soma current are finally integrated.

At the axon hillock action potential are generated if the potential crosses the membrane threshold

The axon transmits (transports) the action potential to distant sites

At the synapses are the outgoing signals transmitted onto the dendrites of the target neurons

Structure of a Neuron:

Page 2: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

2

Chemical synapse: Learning = Change of Synaptic Strength

Neurotransmitter Receptors

Page 3: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

3

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods

Page 4: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

4

Different Types/Classes of Learning

Unsupervised Learning (non-evaluative feedback) • Trial and Error Learning.

• No Error Signal.

• No influence from a Teacher, Correlation evaluation only.

Reinforcement Learning (evaluative feedback) • (Classic. & Instrumental) Conditioning, Reward-based Lng.

• “Good-Bad” Error Signals.

• Teacher defines what is good and what is bad.

Supervised Learning (evaluative error-signal feedback) • Teaching, Coaching, Imitation Learning, Lng. from examples and more.

• Rigorous Error Signals.

• Direct influence from a teacher/teaching signal.

Page 5: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

5

Basic Hebb-Rule: = µ ui v µ << 1 dωi dt

For Learning: One input, one output.

An unsupervised learning rule:

A supervised learning rule (Delta Rule): !i! !i à ör!iE

No input, No output, one Error Function Derivative, where the error function compares input- with output- examples.

A reinforcement learning rule (TD-learning):

One input, one output, one reward.

wi! wi + ö[r(t+ 1) + í v(t+ 1)à v(t)]uà(t)

Page 6: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

6

map

Self-organizing maps: unsupervised learning

Neighborhood relationships are usually preserved (+)

Absolute structure depends on initial condition and cannot be predicted (-)

input

Page 7: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

7

Basic Hebb-Rule: = µ ui v µ << 1 dωi dt

For Learning: One input, one output

An unsupervised learning rule:

A supervised learning rule (Delta Rule): !i! !i à ör!iE

No input, No output, one Error Function Derivative, where the error function compares input- with output- examples.

A reinforcement learning rule (TD-learning):

One input, one output, one reward

wi! wi + ö[r(t+ 1) + í v(t+ 1)à v(t)]uà(t)

Page 8: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

8

I. Pawlow

Classical Conditioning

Page 9: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

9

Basic Hebb-Rule: = µ ui v µ << 1 dωi dt

For Learning: One input, one output

An unsupervised learning rule:

A supervised learning rule (Delta Rule): !i! !i à ör!iE

No input, No output, one Error Function Derivative, where the error function compares input- with output- examples.

A reinforcement learning rule (TD-learning):

One input, one output, one reward

wi! wi + ö[r(t+ 1) + í v(t+ 1)à v(t)]uà(t)

Page 10: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

10

Supervised Learning: Example OCR

Page 11: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

11

The influence of the type of learning on speed and autonomy of the learner

Correlation based learning: No teacher Reinforcement learning , indirect influence Reinforcement learning, direct influence Supervised Learning, Teacher Programming

Learning Speed Autonomy

Page 12: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

12

Hebbian learning

A B

A

B

t

When an axon of cell A excites cell B and repeatedly or persistently takes part in firing it, some growth processes or metabolic change takes place in one or both cells so that A‘s efficiency ... is increased.

Donald Hebb (1949)

Page 13: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

13

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods

Page 14: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

14

Hebbian Learning

…Basic Hebb-Rule:

…correlates inputs with outputs by the…

= µ v u1 µ << 1 dω1

dt

v u1 ω1

Vector Notation Cell Activity: v = w . u

This is a dot product, where w is a weight vector and u the input vector. Strictly we need to assume that weight changes are slow, otherwise this turns into a differential eq.

Page 15: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

15

= µ v u1 µ << 1 dω1

dt Single Input

= µ v u µ << 1 dw

dt Many Inputs As v is a single output, it is scalar.

Averaging Inputs = µ <v u> µ << 1

dw

dt We can just average over all input patterns and approximate the weight change by this. Remember, this assumes that weight changes are slow.

If we replace v with w . u we can write:

= µ Q . w where Q = <uu> is the input correlation matrix

dw

dt

Note: Hebb yields an instable (always growing) weight vector!

Page 16: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

16

Synaptic plasticity evoked artificially Examples of Long term potentiation (LTP) and long term depression (LTD). LTP First demonstrated by Bliss and Lomo in 1973. Since then induced in many different ways, usually in slice. LTD, robustly shown by Dudek and Bear in 1992, in Hippocampal slice.

Page 17: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

17

Page 18: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

18

Page 19: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

19

Page 20: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

20

LTP will lead to new synaptic contacts

Page 21: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

21

Conventional LTP = Hebbian Learning

Symmetrical Weight-change curve

Pre

tPre

Post

tPost

Synaptic change %

Pre

tPre

Post

tPost

The temporal order of input and output does not play any role

Page 22: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

22

Page 23: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

23

Spike timing dependent plasticity - STDP

Markram et. al. 1997

Page 24: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

24

Pre follows Post: Long-term Depression

Pre

tPre

Post

tPost

Synaptic

change %

Spike Timing Dependent Plasticity: Temporal Hebbian Learning

Weight-change curve (Bi&Poo, 2001)

Pre

tPre

Post

tPost

Pre precedes Post: Long-term

Potentiation

Page 25: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

25

= µ v u1 µ << 1 dω1

dt Single Input

= µ v u µ << 1 dw

dt Many Inputs As v is a single output, it is scalar.

Averaging Inputs = µ <v u> µ << 1

dw

dt We can just average over all input patterns and approximate the weight change by this. Remember, this assumes that weight changes are slow.

If we replace v with w . u we can write:

= µ Q . w where Q = <uu> is the input correlation matrix

dw

dt

Note: Hebb yields an instable (always growing) weight vector!

Back to the Math. We had:

Page 26: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

26

= µ (v - Θ) u µ << 1 dw

dt

Covariance Rule(s)

Normally firing rates are only positive and plain Hebb would yield only LTP. Hence we introduce a threshold to also get LTD

Output threshold

= µ v (u - Θ) µ << 1 dw

dt Input vector threshold

Many times one sets the threshold as the average activity of some reference time period (training period)

Θ = <v> or Θ = <u> together with v = w . u we get:

= µ C . w, where C is the covariance matrix of the input dw

dt http://en.wikipedia.org/wiki/Covariance_matrix

C = <(u-<u>)(u-<u>)> = <uu> - <u2> = <(u-<u>)u>

Page 27: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

27

The covariance rule can produce LTP without (!) post-synaptic output. This is biologically unrealistic and the BCM rule (Bienenstock, Cooper, Munro) takes care of this.

BCM- Rule

= µ vu (v - Θ) µ << 1 dw

dt

As such this rule is again unstable, but BCM introduces a sliding threshold

= ν (v2 - Θ) ν < 1 dΘ

dt

Note the rate of threshold change ν should be faster than then weight changes (µ), but slower than the presentation of the individual input patterns. This way the weight growth will be over-dampened relative to the (weight – induced) activity increase.

Page 28: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

28

Evidence for weight normalization: Reduced weight increase as soon as weights are already big (Bi and Poo, 1998, J. Neurosci.)

Problem: Hebbian Learning can lead to unlimited weight growth.

Solution: Weight normalization a) subtractive (subtract the mean change of all weights from each individual weight). b) multiplicative (mult. each weight by a gradually decreasing factor).

Page 29: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

29

Examples of Applications • Kohonen (1984). Speech recognition - a map of

phonemes in the Finish language • Goodhill (1993) proposed a model for the

development of retinotopy and ocular dominance, based on Kohonen Maps (SOM)

• Angeliol et al (1988) – travelling salesman problem (an optimization problem)

• Kohonen (1990) – learning vector quantization (pattern classification problem)

• Ritter & Kohonen (1989) – semantic maps

OD ORI

Page 30: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

30

Differential Hebbian Learning of Sequences Learning to act in response to sequences of sensor events

Page 31: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

31

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods

You are here !

Page 32: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

32

I. Pawlow

History of the Concept of Temporally Asymmetrical Learning: Classical Conditioning

Page 33: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

33

Page 34: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

34

I. Pawlow

History of the Concept of Temporally Asymmetrical Learning: Classical Conditioning

Correlating two stimuli which are shifted with respect to each other in time. Pavlov’s Dog: “Bell comes earlier than Food” This requires to remember the stimuli in the system. Eligibility Trace: A synapse remains “eligible” for modification for some time after it was active (Hull 1938, then a still abstract concept).

Page 35: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

35

Σ ω0 = 1

ω1

Unconditioned Stimulus (Food)

Conditioned Stimulus (Bell)

Response

Σ

X

∆ω1 + Stimulus Trace E

The first stimulus needs to be “remembered” in the system

Classical Conditioning: Eligibility Traces

Page 36: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

36

I. Pawlow

History of the Concept of Temporally Asymmetrical Learning: Classical Conditioning

Eligibility Traces

Note: There are vastly different time-scales for (Pavlov’s) hehavioural experiments:

Typically up to 4 seconds

as compared to STDP at neurons:

Typically 40-60 milliseconds (max.)

Page 37: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

37

Defining the Trace In general there are many ways to do this, but usually one chooses a trace that looks biologically realistic and allows for some analytical calculations, too.

EPSP-like functions: α-function:

Double exp.:

This one is most easy to handle analytically and, thus, often used.

Dampened Sine wave:

Shows an oscillation.

h(t) =n

0 t<0hk(t) tõ0

h(t) = teàatk

h(t) = b1 sin(bt) eàat

k

h(t) = î1 (eàatà eàbt)

k

Page 38: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

38

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods

Mathematical formulation of learning rules is

similar but time-scales are much different.

Page 39: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

39

Σ

ω Early: “Bell”

Late: “Food”

x

)( )( )( tytutdtd

ii ′µ=ω

Differential Hebb Learning Rule

Xi

X0

Simpler Notation x = Input u = Traced Input

V

V’(t)

ui

u0

Page 40: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

40

Convolution used to define the traced input, Correlation used to calculate weight growth.

)()()()()()()( xfxgxgxfduuxgufxh ==−= ∫∞

∞−

u

)()()()()()()( xgxfxfxgduxugufxh ∗=/∗=−= ∫∞

∞−

w

Page 41: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

41

Produces asymmetric weight change curve (if the filters h produce unimodal „humps“)

)(' )( )( tvtutdtd

ii µω =

Derivative of the Output

Filtered Input

∑= )( )()( tuttv iiω

Output

∆ω

T

Differential Hebbian Learning

Page 42: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

42

Conventional LTP

Symmetrical Weight-change curve

Pre

tPre

Post

tPost

Synaptic change %

Pre

tPre

Post

tPost

The temporal order of input and output does not play any role

Page 43: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

43

Produces asymmetric weight change curve (if the filters h produce unimodal „humps“)

)(' )( )( tvtutdtd

ii µω =

Derivative of the Output

Filtered Input

∑= )( )()( tuttv iiω

Output

∆ω

T

Differential Hebbian Learning

Page 44: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

44 Weight-change curve

(Bi&Poo, 2001)

T=tPost - tPre ms

Pre follows Post: Long-term Depression

Pre

tPre

Post

tPost

Synaptic change % Pre

tPre

Post

tPost

Pre precedes Post: Long-term

Potentiation

Spike-timing-dependent plasticity (STDP): Some vague shape similarity

Page 45: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

45

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods

You are here !

Page 46: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

46

Plastic Synapse

NMDA/AMPA

Postsynaptic: Source of Depolarization

The biophysical equivalent of Hebb’s postulate

Presynaptic Signal (Glu)

Pre-Post Correlation, but why is this needed?

Page 47: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

47

inout

inout

Plasticity is mainly mediated by so called N-methyl-D-Aspartate (NMDA) channels. These channels respond to Glutamate as their transmitter and they are voltage depended:

Page 48: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

48

Biophysical Model: Structure

x NMDA synapse

v

Hence NMDA-synapses (channels) do require a (hebbian) correlation between pre and post-synaptic activity!

Source of depolarization:

1) Any other drive (AMPA or NMDA)

2) Back-propagating spike

Page 49: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

49

Local Events at the Synapse

ΣLocal

Current sources “under” the synapse: • Synaptic current

Isynaptic

ΣGlobal IBP

• Influence of a Back-propagating spike • Currents from all parts of the dendritic tree

IDendritic

u1

x1

v

Page 50: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning
Page 51: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

51

Σ

ω

Pre-syn. Spike

BP- or D-Spike

* 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 2 4 6 8 10

V*h

gNMDA

0 40 80 t [ms]

g [nS]NMDA

0.1

On „Eligibility Traces“

Membrane potential:

Weight Synaptic input

Depolarization source

deprest

iii

ii IR

tVVVEttVdtdC +

−+−∆+= ∑ )())((g )()( ωω

ω1

ω0

X

v

v’

ISO-Learning

h

x

x0

1

Page 52: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

52

• Dendritic compartment

• Plastic synapse with NMDA channels Source of Ca2+ influx and coincidence detector

Plastic Synapse NMDA/AMPA

depi

ii IVEt~dtdV

+−∑ ))((g

NMDA/AMPA g BP spike

Source of Depolarization

Dendritic spike

• Source of depolarization: 1. Back-propagating spike 2. Local dendritic spike

Model structure

Page 53: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

53

Plasticity Rule (Differential Hebb)

NMDA synapse -Plastic synapse

depi

ii IVEtdtdV

+−∑ ))((g ~

NMDA/AMPA g

NMDA/AMPA

Source of depolarization

Instantenous weight change:

)(' )( )( tFtctdtd

Nµ=ω

Presynaptic influence Glutamate effect on NMDA channels

Postsynaptic influence

Page 54: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

54

0 40 80 t [ms]

g [nS]NMDA

0.1

Normalized NMDA conductance:

NMDA channels are instrumental for LTP and LTD induction (Malenka and Nicoll, 1999; Dudek and Bear ,1992)

V

tt

N eMgeec γ−+

τ−τ−

η+−

=][1 2

// 21

Pre-synaptic influence

NMDA synapse -Plastic synapse

depi

ii IVEtdtdV

+−∑ ))((g ~

NMDA/AMPA g

NMDA/AMPA

Source of depolarization

)(' )( )( tFtctdtd

Nµ=ω

Page 55: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

55

0 10

0

-40-60

-20

20V [mV]

20 t [ms]

0 10

0

-40-60

-20

20V [mV]

20 t [ms]

0 10

0

-40-60

-20

20V [mV]

20 t [ms]

0 10

0

-40-60

-20

20V [mV]

20 t [ms]

Dendritic spikes

Back-propagating spikes

(Larkum et al., 2001

Golding et al, 2002

Häusser and Mel, 2003)

(Stuart et al., 1997)

Depolarizing potentials in the dendritic tree

Page 56: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

56

NMDA synapse -Plastic synapse

depi

ii IVEtdtdV

+−∑ ))((g ~

NMDA/AMPA g

NMDA/AMPA

Source of depolarization

Postsyn. Influence

)(' )( )( tFtctdtd

Nµ=ω

For F we use a low-pass filtered („slow“) version of a back-propagating or a dendritic spike.

Page 57: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

57

0 10

0

-40-60

-20

20V [mV]

20 t [ms]

0 10

0

-40-60

-20

20V [mV]

20 t [ms]

0 50 150 t [ms]100

0

-40

-60

-20

V [mV]

0 50 150 t [ms]100

0

-40

-60

-20

V [mV]

0 20 80 t [ms]40 60

0

-40

-60

-20

V [mV]

0 20 80 t [ms]40 60

0

-40

-60

-20

V [mV]

0 10

0

-40-60

-20

20V [mV]

20 t [ms]

0 10

0

-40-60

-20

20V [mV]

20 t [ms]

BP and D-Spikes

Page 58: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

58

0 10

0

-40-60

-20

20V [mV]

20 t [ms]

0 10

0

-40-60

-20

20V [mV]

20 t [ms]

0-20 40 T [ms]-40 20

-0.01

-0.03

-0.01

0.01∆ω

0-20 40 T [ms]-40 20

-0.01

-0.03

-0.01

0.01∆ω

Back-propagating spike

Weight change curve

T

NMDAr activation

Back-propagating spike

T=tPost – tPre

Weight Change Curves Source of Depolarization: Back-Propagating Spikes

Page 59: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

59

Plastic Synapse

NMDA/AMPA

Postsynaptic: Source of Depolarization

The biophysical equivalent of Hebb’s PRE-POST CORRELATION postulate:

THINGS TO REMEMBER

Presynaptic Signal (Glu)

Possible sources are: BP-Spike Dendritic Spike Local Depolarization

Page 60: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

60

One word about

Supervised Learning

Page 61: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

61

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods – Supervised Learning

And many more

Page 62: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

62

Supervised learning methods are mostly non-neuronal and will therefore not

be discussed here.

Page 63: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

63

So Far:

• Open Loop Learning

All slides so far !

Page 64: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

64

CLOSED LOOP LEARNING

• Learning to Act (to produce appropriate behavior)

• Instrumental (Operant) Conditioning

All slides to come now !

Page 65: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

65

Sensor 2

conditioned Input

Bell

Food

Salivation

Pavlov, 1927

Temporal Sequence

Page 66: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

66

Adaptable Neuron

Env.

Closed loop

Sensing Behaving

Page 67: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

67

Instrumental/Operant Conditioning

Page 68: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

68

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods – Closed Loop Learning

Page 69: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

69

Behaviorism “All we need to know in order

to describe and explain behavior is this: actions

followed by good outcomes are likely to recur, and

actions followed by bad outcomes are less likely to

recur.” (Skinner, 1953)

Skinner had invented the type of experiments called operant conditioning.

B.F. Skinner (1904-1990)

Page 70: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

70

Operant behavior: occurs without an observable external stimulus. Operates on the organism’s environment. The behavior is instrumental in securing a stimulus more representative of everyday learning.

Skinner Box

Page 71: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

71

OPERANT CONDITIONING TECHNIQUES

• POSITIVE REINFORCEMENT = increasing a behavior by administering a reward

• NEGATIVE REINFORCEMENT = increasing a behavior by removing an aversive stimulus when a behavior occurs

• PUNISHMENT = decreasing a behavior by administering an aversive stimulus following a behavior OR by removing a positive stimulus

• EXTINCTION = decreasing a behavior by not rewarding it

Page 72: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

72

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods

You are here !

Page 73: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

73

How to assure behavioral & learning convergence ??

This is achieved by starting with a stable reflex-like action and learning to supercede it by an anticipatory action.

Remove before being hit !

Page 74: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

74

Controller ControlledSystem

ControlSignals

Feedback

DisturbancesSet-Point

X0

Reflex Only

(Compare to an electronic closed loop controller!)

This structure assures initial (behavioral) stability (“homeostasis”)

Think of a Thermostat !

Page 75: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

75

Robot Application

Σ ω

x Early: “Vision”

Late: “Bump”

Page 76: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

76

Robot Application

Initially built-in behavior: Retraction reaction whenever an obstacle is touched.

Learning Goal: Correlate the vision signals with the touch signals and navigate without collisions.

Page 77: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

77

Robot Example

Page 78: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

78

Controller ControlledSystem

ControlSignals

Feedback

DisturbancesSet-Point

X0X1early late

What has happened during learning to the system ?

The primary reflex re-action has effectively been eliminated and replaced by an anticipatory action

Page 79: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Reinforcement Learning (RL) Learning from rewards (and punishments) Learning to assess the value of states.

Learning goal directed behavior.

RL has been developed rather independently from two different fields:

1) Dynamic Programming and Machine Learning (Bellman Equation).

2) Psychology (Classical Conditioning) and later Neuroscience (Dopamine System in the brain)

Page 80: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

I. Pawlow

Back to Classical Conditioning

U(C)S = Unconditioned Stimulus U(C)R = Unconditioned Response CS = Conditioned Stimulus CR = Conditioned Response

Page 81: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Less “classical” but also Conditioning ! (Example from a car advertisement)

Learning the association CS → U(C)R

Porsche → Good Feeling

Page 82: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods – Reinforcement Learning

You are here !

Page 83: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods – Reinforcement Learning

And later also here !

Page 84: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

US = r,R = “Reward” CS = s,u = Stimulus = “State1” CR = v,V = (Strength of the) Expected Reward = “Value” UR = --- (not required in mathematical formalisms of RL) Weight = ω = weight used for calculating the value; e.g. v=ωu Action = a = “Action” Policy = π = “Policy”

1 Note: The notion of a “state” really only makes sense as soon as there is more than one state.

Notation

Page 85: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

A note on “Value” and “Reward Expectation” If you are at a certain state then you would value this state according to how much reward you can expect when moving on from this state to the end-point of your trial. Hence: Value = Expected Reward ! More accurately: Value = Expected cumulative future discounted reward. (for this, see later!)

Page 86: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

1) Rescorla-Wagner Rule: Allows for explaining several types of conditioning experiments.

2) TD-rule (TD-algorithm) allows measuring the value of states and allows accumulating rewards. Thereby it generalizes the Resc.-Wagner rule.

3) TD-algorithm can be extended to allow measuring the value of actions and thereby control behavior either by ways of a) Q or SARSA learning or with b) Actor-Critic Architectures

Types of Rules

Page 87: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods – Reinforcement Learning

You are here !

Page 88: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Rescorla-Wagner Rule

Pavlovian: Extinction: Partial:

Train Result

u→r

u→r u→●

Pre-Train

u→r u→●

u→v=max

u→v=0

u→v<max

We define: v = ωu, with u=1 or u=0, binary and ω → ω + µδu with δ = r - v

This learning rule minimizes the avg. squared error between actual reward r and the prediction v, hence min<(r-v)2>

We realize that δ is the prediction error.

The associability between stimulus u and reward r is represented by the learning rate µ.

Page 89: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Extinction 10 20 30 40 50 60 70 80 90 100 110 120

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

reward expected reward

prediction error

Pawlovian

Page 90: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Pawlovian

Extinction Partial

Stimulus u is paired with r=1 in 100% of the discrete “epochs” for Pawlovian and in 50% of the cases for Partial.

Page 91: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

10 20 30 40 50 60 70 80 90 100 110 120

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Partial (50% reward)

Page 92: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Rescorla-Wagner Rule, Vector Form for Multiple Stimuli

We define: v = w.u, and w → w + µδu with δ = r – v Where we minimize δ.

Blocking:

Train Result

u1+u2→r

Pre-Train

u1→v=max, u2→v=0 u1→r

For Blocking: The association formed during pre-training leads to δ=0. As ω2 starts with zero the expected reward v=ω1u1+ω2u2 remains at r. This keeps δ=0 and the new association with u2 cannot be learned.

Page 93: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Rescorla-Wagner Rule, Vector Form for Multiple Stimuli

Inhibitory: Train Result Pre-Train

u1+u2→●, u1→r u1→v=max, u2→v<0

Inhibitory Conditioning: Presentation of one stimulus together with the reward and alternating presenting a pair of stimuli where the reward is missing. In this case the second stimulus actually predicts the ABSENCE of the reward (negative v). Trials in which the first stimulus is presented together with the reward lead to ω1>0. In trials where both stimuli are present the net prediction will be v=ω1u1+ω2u2 = 0. As u1,2=1 (or zero) and ω1>0, we get ω2<0 and, consequentially, v(u2)<0.

Page 94: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Rescorla-Wagner Rule, Vector Form for Multiple Stimuli

Overshadow: Train Result Pre-Train

u1+u2→r u1→v<max, u2→v<max

Overshadowing: Presenting always two stimuli together with the reward will lead to a “sharing” of the reward prediction between them. We get v= ω1u1+ω2u2 = r. Using different learning rates µ will lead to differently strong growth of ω1,2 and represents the often observed different saliency of the two stimuli.

Page 95: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Rescorla-Wagner Rule, Vector Form for Multiple Stimuli

Secondary:

Train Result Pre-Train

u1→r u2→u1 u2→v=max

Secondary Conditioning reflect the “replacement” of one stimulus by a new one for the prediction of a reward. As we have seen the Rescorla-Wagner Rule is very simple but still able to represent many of the basic findings of diverse conditioning experiments. Secondary conditioning, however, CANNOT be captured.

Page 96: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Predicting Future Reward

Animals can predict to some degree such sequences and form the correct associations. For this we need algorithms that keep track of time. Here we do this by ways of states that are subsequently visited and evaluated.

The Rescorla-Wagner Rule cannot deal with the sequentiallity of stimuli (required to deal with Secondary Conditioning). As a consequence it treats this case similar to Inhibitory Conditioning lead to negative ω2.

Page 97: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Prediction and Control

The goal of RL is two-fold: 1) To predict the value of states (exploring the state space

following a policy) – Prediction Problem. 2) Change the policy towards finding the optimal policy –

Control Problem.

• State, • Action, • Reward, • Value, • Policy

Terminology (again):

Page 98: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Markov Decision Problems (MDPs)

1 2 3 4 5 6 7 8

9 10 11 12

13 14

15 16

r1 r2a2 a15a14a1

s

te rm ina l sta tes

states

actions rewards

If the future of the system depends always only on the current state and action then the system is said to be “Markovian”.

Page 99: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

What does an RL-agent do ? An RL-agent explores the state space trying to accumulate as much reward as possible. It follows a behavioral policy performing actions (which usually will lead the agent from one state to the next). For the Prediction Problem: It updates the value of each given state by assessing how much future (!) reward can be obtained when moving onwards from this state (State Space). It does not change the policy, rather it evaluates it. (Policy Evaluation).

Page 100: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

For the Control Problem: It updates the value of each given action at a given state and of by assessing how much future reward can be obtained when performing this action at that state (State-Action Space, which is larger than the State Space). and all following actions at the following state moving onwards. Guess: Will we have to evaluate ALL states and actions onwards?

p(N) = 0.5p(S) = 0.125p(W) = 0.25p(E) = 0.125

Policy:

x x x x x

R R

0.0

value = 0.0everywherereward R=1

possible startlocations

0.9

0.9

0.8

0.1 0.1 0.1 0.1 0.1

etc

Policy Evaluationgive values of states

Page 101: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Exploration – Exploitation Dilemma: The agent wants to get as much cumulative reward (also often called return) as possible. For this it should always perform the most rewarding action “exploiting” its (learned) knowledge of the state space. This way it might however miss an action which leads (a bit further on) to a much more rewarding path. Hence the agent must also “explore” into unknown parts of the state space. The agent must, thus, balance its policy to include exploitation and exploration.

What does an RL-agent do ?

Policies 1) Greedy Policy: The agent always exploits and selects the

most rewarding action. This is sub-optimal as the agent never finds better new paths.

Page 102: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Policies 2) ε-Greedy Policy: With a small probability ε the agent

will choose a non-optimal action. *All non-optimal actions are chosen with equal probability.* This can take very long as it is not known how big ε should be. One can also “anneal” the system by gradually lowering ε to become more and more greedy.

3) Softmax Policy: ε-greedy can be problematic because of (*). Softmax ranks the actions according to their values and chooses roughly following the ranking using for example:

P

b=1

n

exp(T

Qb)

exp( TQa) where Qa is value of the currently

to be evaluated action a and T is a temperature parameter. For large T all actions have approx. equal probability to get selected.

Page 103: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Overview over different methods – Reinforcement Learning

You are here !

Page 104: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Back to the question: To get the value of a given state, will we have to evaluate ALL states and actions onwards?

There is no unique answer to this! Different methods exist which assign the value of a state by using differently many (weighted) values of subsequent states. We will discuss a few but concentrate on the most commonly used TD-algorithm(s).

Temporal Difference (TD) Learning

Towards TD-learning – Pictorial View In the following slides we will treat “Policy evaluation”: We define some given policy and want to evaluate the state space. We are at the moment still not interested in evaluating actions or in improving policies.

Page 105: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Formalising RL: Policy Evaluation with goal to find the optimal value function of the state space We consider a sequence st, rt+1, st+1, rt+2, . . . , rT , sT . Note, rewards occur downstream (in the future) from a visited state. Thus, rt+1 is the next future reward which can be reached starting from state st. The complete return Rt to be expected in the future from state st is, thus, given by:

where γ≤1 is a discount factor. This accounts for the fact that rewards in the far future should be valued less. Reinforcement learning assumes that the value of a state V(s) is directly equivalent to the expected return Eπ at this state, where π denotes the (here unspecified) action policy to be followed.

Thus, the value of state st can be iteratively updated with:

Page 106: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

We use α as a step-size parameter, which is not of great importance here, though, and can be held constant. Note, if V(st) correctly predicts the expected complete return Rt, the update will be zero and we have found the final value. This method is called constant-α Monte Carlo update. It requires to wait until a sequence has reached its terminal state (see some slides before!) before the update can commence. For long sequences this may be problematic. Thus, one should try to use an incremental procedure instead. We define a different update rule with:

The elegant trick is to assume that, if the process converges, the value of the next state V(st+1) should be an accurate estimate of the expected return downstream to this state (i.e., downstream to st+1). Thus, we would hope that the following holds:

Indeed, proofs exist that under certain boundary conditions this procedure, known as TD(0), converges to the optimal value function for all states.

This is why it is called TD (temp. diff.) Learning

| {z }

Page 107: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Reinforcement Learning – Relations to Brain Function I

You are here !

Page 108: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Trace

ω

δ

1

X

x1

r

vv ’Σ

E

Σu1

How to implement TD in a Neuronal Way

Now we have:

wi! wi + ö[r(t+ 1) + í v(t+ 1)à v(t)]uà(t)

We had defined: (first lecture!)

Page 109: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

X 0

X 1

X n

v( t)

x

x

v ’

re w a rd

(n - i)τ

δ

How to implement TD in a Neuronal Way

v(t+1)-v(t)

Note: v(t+1)-v(t) is acausal (future!). Make it “causal” by using delays.

x

w = 10X 0

X 1

re w a rd

τ τδ

v (t)v (t- )τ

r

Serial-Compound representations X1,…Xn for defining an eligibility trace.

Page 110: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Reinforcement Learning – Relations to Brain Function II

You are here !

Page 111: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

TD-learning & Brain Function N o v e lty R e s p o n s e :n o p re d ic tio n ,re w a rd o c c u rs

n o C S r

A fte r le a rn in g :p re d ic te d re w a rd o c c u rs

C S r

DA-responses in the basal ganglia pars compacta of the substantia nigra and the medially adjoining ventral tegmental area (VTA).

This neuron is supposed to represent the δ-error of TD-learning, which has moved forward as expected.

A fte r le a rn in g :p re d ic te d re w a rd d o e s n o to c c u r

C S 1 .0 s

Omission of reward leads to inhibition as also predicted by the TD-rule.

Page 112: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

TD-learning & Brain Function

1 .5 srTr

R e w a rdE x p e c ta tio n

This neuron is supposed to represent the reward expectation signal v. It has extended forward (almost) to the CS (here called Tr) as expected from the TD-rule. Such neurons are found in the striatum, orbitofrontal cortex and amygdala.

1 .0 s

R e w a rd E x p e c ta tio n(P o p u la tio n R e s p o n s e )

T r r

This is even better visible from the population response of 68 striatal neurons

Page 113: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Reinforcement Learning – The Control Problem So far we have concentrated on evaluating and unchanging policy. Now comes the question of how to actually improve a policy π trying to find the optimal policy.

We will discuss: 1) Actor-Critic Architectures But not: 2) SARSA Learning 3) Q-Learning

Abbreviation for policy: π

Page 114: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Reinforcement Learning – Control Problem I

You are here !

Page 115: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

Control Loops

C o n tro lle r C o n tro lle dS yste m

C o n tro lS ig n a ls

Fe e d b a ck

D istu rb a n ce sS e t-P o in t

X 0

A basic feedback–loop controller (Reflex) as in the slide before.

Page 116: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

A cto r(C o n tro lle r)

E n viro n m e n t(C o n tro lle d S y s te m )

Fe e d b a ck

D istu rb a n ce s

C o n te xtC ritic

A ctio n s(C o n tro l S ig n a ls )

R e in fo rce m e n tS ig n a l

X 0

Control Loops

An Actor-Critic Architecture: The Critic produces evaluative, reinforcement feedback for the Actor by observing the consequences of its actions. The Critic takes the form of a TD-error which gives an indication if things have gone better or worse than expected with the preceding action. Thus, this TD-error can be used to evaluate the preceding action: If the error is positive the tendency to select this action should be strengthened or else, lessened.

Page 117: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

ù(s; a) = Pb e

p(s;b)ep(s;a)

Example of an Actor-Critic Procedure

Action selection here follows the Gibb’s Softmax method:

where p(s,a) are the values of the modifiable (by the Critic!) policy parameters of the actor, indicating the tendency to select action a when being in state s.

p(st; at) p(st; at) + ìît

We can now modify p for a given state action pair at time t with:

where δt is the δ-error of the TD-Critic.

Page 118: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

M achine Learn ing C lass ica l C ondition ing Synaptic P las tic ity

D ynam ic Prog .(Be llm an Eq .)

R EIN FO R C EM EN T LEAR N IN G U N -SU PERVISED LEAR N IN Ge x a m p le b a s e d c o rre la tio n b a s e d

δ -R u le

M onte C arloC on tro l

Q -Learn ing

TD ( )o ften = 0

λλ

TD (1) TD (0 )

R escorla /W agner

N e u r.T D -M o d e ls(“C ritic ”)

N e u r.T D -fo rm a lism

D iffe ren tia lH ebb-R u le

(”fas t”)

STD P-M ode lsb io p h y s ic a l & n e tw o rk

EVALU ATIVE FEED BAC K (R ew ards )

N O N -EVALU ATIVE FEED BAC K (C orre la tions )

S A R S AC o rre la tio n

b a se d C o n tro l(non -eva lua t ive )

IS O -L e a rn in g

IS O -M o d e lo f S T D P

A cto r /C r iticte c h n ic a l & B a s a l G a n g l.

E lig ibility Tra ce s

H ebb-R u le

D iffe ren tia lH ebb-R u le

(”s low ”)

supe rv ised L .

A n tic ip a to ry C o n tro l o f A c tio n s a n d P re d ic tio n o f Va lu e s C o r re la tio n o f S ig n a ls

=

=

=

N eurona l R ew ard Sys tem s(Basa l G ang lia )

B iophys . o f Syn . P las tic ityD o p a m in e G lu ta m a te

STD P

LTP(LT D = a n ti)

IS O -C on tro l

Reinforcement Learning – Control I & Brain Function III

You are here !

Page 119: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

C ortex (C ) Fron ta lC ortex

V P S N r G P i

D A -S ys tem(S N c ,V TA ,R R A )

Tha lam us

S tria tum (S )G P e

S TN

Actor-Critics and the Basal Ganglia

VP=ventral pallidum, SNr=substantia nigra pars reticulata, SNc=substantia nigra pars compacta, GPi=globus pallidus pars interna, GPe=globus pallidus pars externa, VTA=ventral tegmental area, RRA=retrorubral area, STN=subthalamic nucleus.

The basal ganglia are a brain structure involved in motor control. It has been suggested that they learn by ways of an Actor-Critic mechanism.

Page 120: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

So called striosomal modules of the Striatum S fulfill the functions of the adaptive Critic. The prediction-error (δ) characteristics of the DA-neurons of the Critic are generated by: 1) Equating the reward r with excitatory input from the lateral hypothalamus. 2) Equating the term v(t) with indirect excitation at the DA-neurons which is initiated from striatal striosomes and channelled through the subthalamic nucleus onto the DA neurons. 3) Equating the term v(t−1) with direct, long-lasting inhibition from striatal striosomes onto the DA-neurons. There are many problems with this simplistic view though: timing, mismatch to anatomy, etc.

C

S

STN

D A r+

-Cortex=C, striatum=S, STN=subthalamic Nucleus, DA=dopamine system, r=reward.

Actor-Critics and the Basal Ganglia: The Critic

D AG lu

C o r tico -s tr ia ta l( ”p re ” )

N ig ro -s tr ia ta l( ”D A ”)

M e d iu m -s iz e d S p in y P ro je c tio nN e u ro n in th e S tria tu m (”p o s t”)

C DA

δ

v(t-1)

v(t)

LH

Page 121: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

121

Literature (all of this is very mathematical!)

General Theoretical Neuroscience:

„Theoretical Neuroscience“, P.Dayan and L. Abbott, MIT Press (there used to be a version of this on the internet)

„Spiking Neuron Models“, W. Gerstner & W.M. Kistler, Cambridge University Press. (there is a version on the internet)

Neural Coding Issues: „Spikes“ F. Rieke, D. Warland, R. de Ruyter v. Steveninck, W. Bialek, MIT Press

Artificial Neural Networks: „Konnektionismus“, G. Dorffner, B.G. Teubner Verlg. Stuttgart

„Fundamentals of Artificial Neural Networks“, M.H. Hassoun, MIT Press

Hodgkin Huxley Model: See above „Spiking Neuron Models“, W. Gerstner & W.M. Kistler, Cambridge University Press.

Learning and Plasticity: See above „Spiking Neuron Models“, W. Gerstner & W.M. Kistler, Cambridge University Press.

Calculating with Neurons: Has been compiled from many different sources.

Maps: Has been compiled from many different sources.

Page 122: Structure of a Neuron - uni-goettingen.de€¦ · Structure of a Neuron: 2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter . Receptors . 3 . Machine Learning

122