View
222
Download
2
Embed Size (px)
Citation preview
Hebb’s rule is not sufficient
What happens if the neural circuit fires perfectly, but the result is very bad for the animal, like eating something sickening? A pure invocation of Hebb’s rule would strengthen all participating
connections, which can’t be good. On the other hand, it isn’t right to weaken all the active connections
involved; much of the activity was just recognizing the situation – we would like to change only those connections that led to the wrong decision.
No one knows how to specify a learning rule that will change exactly the offending connections when an error occurs. Computer systems, and presumably nature as well, rely upon statistical
learning rules that tend to make the right changes over time. More in later lectures.
Hebb’s rule is insufficient
should you “punish” all the connections?
tastebud tastes rotten eats food gets sick
drinks water
Models of Learning
Hebbian – coincidence Supervised – correction (backprop) Recruitment – one-trial Reinforcement Learning- delayed reward Unsupervised – similarity
Abbstract Neuron
w2 wnw1
w0
i0=1
o u t p u t y
i2 ini1. . .
i n p u t i
n
i
iiiwnet0
y 1 if net > 00 otherwise{
Threshold Activation Function
Boolean XOR
input x1
input x2
output
0 0 0
0 1 1
1 0 1
1 1 0
h2
x2
o
x1
h1
1
1.5
AND
11
0.5
OR
1
1
0.5
XOR
1
Supervised Learning - Backprop
How do we train the weights of the networkBasic Concepts
Use a continuous, differentiable activation function (Sigmoid)
Use the idea of gradient descent on the error surface
Extend to multiple layers
Backprop To learn on data which is not linearly
separable:Build multiple layer networks (hidden layer)Use a sigmoid squashing function instead
of a step function.
Tasks
Unconstrained pattern classificationCredit assessmentDigit Classification
Speech Recognition
Function approximationLearning controlStock prediction
Sigmoid Squashing Function
w2 wnw1
w0
y0=1
o u t p u t
y2 yny1. . .
i n p u t
n
i
iiywnet0
netey
1
1
Learning Rule – Gradient Descent on an Root Mean Square (RMS)
Learn wi’s that minimize squared error
21[ ] ( )
2 k kk O
E w t o
O = output layer
Gradient Descent
Gradient:
nw
E
w
E
w
EwE ,...,,][
10
ii w
Ew
Training rule: ][wEw
21[ ] ( )
2 k kk O
E w t o
Gradient Descent
i2
i1global mimimum: this is your goal
it should be 4-D (3 weights) but you get the idea
k j i
wjk wij
E = Error = ½ ∑i (ti – yi)2
yi
ti: targetij
ijij W
EWW
ijij W
EW
jiiiij
i
i
i
iij
yxfytW
x
x
y
y
E
W
E
)('
The derivative of the sigmoid is just ii yy 1
jiiiiij yyyytW 1
ijij yW iiiii yyyt 1
The output layerlearning rate
k j i
wjk wij
E = Error = ½ ∑i (ti – yi)2
yi
ti: target
The hidden layerjk
jk W
EW
jk
j
j
j
jjk W
x
x
y
y
E
W
E
iijiii
i j
i
i
i
ij
Wxfyty
x
x
y
y
E
y
E)(')(
kji
ijiiijk
yxfWxfytW
E
)(')(')(
kjji
ijiiiijk yyyWyyytW
11)(
jkjk yW jji
ijiiiij yyWyyyt
11)(
jji
iijj yyW
1
Let’s just do an example
E = Error = ½ ∑i (ti – yi)2x0 f
i1 w01
y0i2
b=1
w02
w0b
E = ½ (t0 – y0)2
i1 i2 y0
0 0 0
0 1 1
1 0 1
1 1 10.8
0.6
0.5
0
00.6224
0.51/(1+e^-0.5)
E = ½ (0 – 0.6224)2 = 0.1937
ijij yW iiiii yyyt 1
01 i 0
0
00000 1 yyyt
6224.016224.06224.000
1463.00
1463.0
0101 yW
0202 yW
00 bb yW
02 i
0 b
learning rate
suppose = 0.50731.01463.05.00 bW
0.4268
An informal account of BackProp
For each pattern in the training set:
Compute the error at the output nodes
Compute w for each wt in 2nd layer
Compute delta (generalized error expression) for hidden units
Compute w for each wt in 1st layer
After amassing w for all weights and, change each wt a little bit, as determined by the learning rate
jpipij ow
Backpropagation Algorithm
Initialize all weights to small random numbers For each training example do
For each hidden unit h:
For each output unit k:
For each output unit k:
For each hidden unit h:
Update each network weight wij:
ijjij xw
i
ihih xwy )(
k
hkhk xwy )(
)()1( kkkkk ytyy
k
khkhhh wyy )1(
withijijij www
Momentum term
The speed of learning is governed by the learning rate. If the rate is low, convergence is slow If the rate is too high, error oscillates without reaching minimum.
Momentum tends to smooth small weight error fluctuations.
n)(n)y()1n(ij
wn)(ij
w ji
10
the momentum accelerates the descent in steady downhill directions.the momentum has a stabilizing effect in directions that oscillate in time.
Convergence
May get stuck in local minima Weights may diverge
…but often works well in practice
Representation power:2 layer networks : any continuous function3 layer networks : any function
Stopping criteria
Sensible stopping criteria: total mean squared error change:
Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).
generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.
Summary
Multiple layer feed-forward networksReplace Step with Sigmoid (differentiable)
function Learn weights by gradient descent on error
functionBackpropagation algorithm for learningAvoid overfitting by early stopping
Use MLP Neural Networks when …
(vectored) Real inputs, (vectored) real outputs
You’re not interested in understanding how it works
Long training times acceptable Short execution (prediction) times required Robust to noise in the dataset
Applications of FFNNClassification, pattern recognition: FFNN can be applied to tackle non-linearly separable
learning problems. Recognizing printed or handwritten characters, Face recognition Classification of loan applications into credit-worthy and non-
credit-worthy groups Analysis of sonar radar to determine the nature of the source
of a signal
Regression and forecasting: FFNN can be applied to learn non-linear functions
(regression) and in particular functions whose inputs is a sequence of measurements over time (time series).
Elman Nets & Jordan Nets
Updating the context as we receive input In Jordan nets we model “forgetting” as well The recurrent connections have fixed weights You can train these networks using good ol’
backprop
Output
Hidden
Context Input
1
α
Output
Hidden
Context Input
1