Connectionist Models: Backprop Jerome Feldman CS182/CogSci110/Ling109 Spring 2008

Connectionist Models: Backprop

Jerome FeldmanCS182/CogSci110/Ling109Spring 2008

Hebb’s rule is not sufficient

What happens if the neural circuit fires perfectly, but the result is very bad for the animal, like eating something sickening? A pure invocation of Hebb’s rule would strengthen all participating

connections, which can’t be good. On the other hand, it isn’t right to weaken all the active connections

involved; much of the activity was just recognizing the situation – we would like to change only those connections that led to the wrong decision.

No one knows how to specify a learning rule that will change exactly the offending connections when an error occurs. Computer systems, and presumably nature as well, rely upon statistical

learning rules that tend to make the right changes over time. More in later lectures.

Hebb’s rule is insufficient

should you “punish” all the connections?

tastebud tastes rotten eats food gets sick

drinks water

Models of Learning

Hebbian – coincidence Supervised – correction (backprop) Recruitment – one-trial Reinforcement Learning- delayed reward Unsupervised – similarity

Abbstract Neuron

w2 wnw1

w0

i0=1

o u t p u t y

i2 ini1. . .

i n p u t i

n

i

iiiwnet0

y 1 if net > 00 otherwise{

Threshold Activation Function

Boolean XOR

input x1

input x2

output

0 0 0

0 1 1

1 0 1

1 1 0

h2

x2

o

x1

h1

1

1.5

AND

11

0.5

OR

1

1

0.5

XOR

1

Supervised Learning - Backprop

How do we train the weights of the networkBasic Concepts

Use a continuous, differentiable activation function (Sigmoid)

Use the idea of gradient descent on the error surface

Extend to multiple layers

Backprop To learn on data which is not linearly

separable:Build multiple layer networks (hidden layer)Use a sigmoid squashing function instead

of a step function.

Tasks

Unconstrained pattern classificationCredit assessmentDigit Classification

Speech Recognition

Function approximationLearning controlStock prediction

Sigmoid Squashing Function

w2 wnw1

w0

y0=1

o u t p u t

y2 yny1. . .

i n p u t

n

i

iiywnet0

netey

1

1

The Sigmoid Function

x=net

y=a


x=neti

y=a Output=0

Output=1


x=net

y=a Output=0

Output=1

Sensitivity to input

Gradient Descent

Gradient Descent on an error

Learning Rule – Gradient Descent on an Root Mean Square (RMS)

Learn wi’s that minimize squared error

21[ ] ( )

2 k kk O

E w t o

O = output layer

Gradient Descent

Gradient:

nw

E

w

E

w

EwE ,...,,][

10

ii w

Ew

Training rule: ][wEw

21[ ] ( )

2 k kk O

E w t o

Gradient Descent

i2

i1global mimimum: this is your goal

it should be 4-D (3 weights) but you get the idea

Backpropagation Algorithm

Generalization to multiple layers and multiple output units

Backprop Details

Here we go…

k j i

wjk wij

E = Error = ½ ∑i (ti – yi)2

yi

ti: targetij

ijij W

EWW

ijij W

EW

jiiiij

i

i

i

iij

yxfytW

x

x

y

y

E

W

E

)('

The derivative of the sigmoid is just ii yy 1

jiiiiij yyyytW 1

ijij yW iiiii yyyt 1

The output layerlearning rate

Nice Property of Sigmoids

k j i

wjk wij

E = Error = ½ ∑i (ti – yi)2

yi

ti: target

The hidden layerjk

jk W

EW

jk

j

j

j

jjk W

x

x

y

y

E

W

E

iijiii

i j

i

i

i

ij

Wxfyty

x

x

y

y

E

y

E)(')(

kji

ijiiijk

yxfWxfytW

E

)(')(')(

kjji

ijiiiijk yyyWyyytW

11)(

jkjk yW jji

ijiiiij yyWyyyt

11)(

jji

iijj yyW

1

Let’s just do an example

E = Error = ½ ∑i (ti – yi)2x0 f

i1 w01

y0i2

b=1

w02

w0b

E = ½ (t0 – y0)2

i1 i2 y0

0 0 0

0 1 1

1 0 1

1 1 10.8

0.6

0.5

0

00.6224

0.51/(1+e^-0.5)

E = ½ (0 – 0.6224)2 = 0.1937

ijij yW iiiii yyyt 1

01 i 0

0

00000 1 yyyt

6224.016224.06224.000

1463.00

1463.0

0101 yW

0202 yW

00 bb yW

02 i

0 b

learning rate

suppose = 0.50731.01463.05.00 bW

0.4268

An informal account of BackProp

For each pattern in the training set:

Compute the error at the output nodes

Compute w for each wt in 2nd layer

Compute delta (generalized error expression) for hidden units

Compute w for each wt in 1st layer

After amassing w for all weights and, change each wt a little bit, as determined by the learning rate

jpipij ow


Initialize all weights to small random numbers For each training example do

For each hidden unit h:

For each output unit k:

For each output unit k:

For each hidden unit h:

Update each network weight wij:

ijjij xw

i

ihih xwy )(

k

hkhk xwy )(

)()1( kkkkk ytyy

k

khkhhh wyy )1(

withijijij www


“activations”

“errors”

Momentum term

The speed of learning is governed by the learning rate. If the rate is low, convergence is slow If the rate is too high, error oscillates without reaching minimum.

Momentum tends to smooth small weight error fluctuations.

n)(n)y()1n(ij

wn)(ij

w ji

10

the momentum accelerates the descent in steady downhill directions.the momentum has a stabilizing effect in directions that oscillate in time.

Convergence

May get stuck in local minima Weights may diverge

…but often works well in practice

Representation power:2 layer networks : any continuous function3 layer networks : any function

Pattern Separation and NN architecture

Overfitting and generalization

TOO MANY HIDDEN NODES TENDS TO OVERFIT

Stopping criteria

Sensible stopping criteria: total mean squared error change:

Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).

generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.

Overfitting in ANNs

Summary

Multiple layer feed-forward networksReplace Step with Sigmoid (differentiable)

function Learn weights by gradient descent on error

functionBackpropagation algorithm for learningAvoid overfitting by early stopping

ALVINN drives 70mph on highways

Use MLP Neural Networks when …

(vectored) Real inputs, (vectored) real outputs

You’re not interested in understanding how it works

Long training times acceptable Short execution (prediction) times required Robust to noise in the dataset

Applications of FFNNClassification, pattern recognition: FFNN can be applied to tackle non-linearly separable

learning problems. Recognizing printed or handwritten characters, Face recognition Classification of loan applications into credit-worthy and non-

credit-worthy groups Analysis of sonar radar to determine the nature of the source

of a signal

Regression and forecasting: FFNN can be applied to learn non-linear functions

(regression) and in particular functions whose inputs is a sequence of measurements over time (time series).

Extensions of Backprop Nets

Recurrent Architectures Backprop through time

Elman Nets & Jordan Nets

Updating the context as we receive input In Jordan nets we model “forgetting” as well The recurrent connections have fixed weights You can train these networks using good ol’

backprop

Output

Hidden

Context Input

1

α

Output

Hidden

Context Input

1

Recurrent Backprop

we’ll pretend to step through the network one iteration at a time

backprop as usual, but average equivalent weights (e.g. all 3 highlighted edges on the right are equivalent)

a b c unrolling3 iterations

a b c

a b c

a b cw2

w1 w3

w4

w1 w2 w3 w4

a b c

Documents

Connectionist Models: Backprop Jerome Feldman CS182/CogSci110/Ling109 Spring 2008