Neural Networks - lecture 51 Multi-layer neural networks Motivation Choosing the architecture Functioning. FORWARD algorithm Neural networks as

Neural Networks - lecture 5 1

Multi-layer neural networks Motivation

Choosing the architecture

Functioning. FORWARD algorithm

Neural networks as universal approximators

The Backpropagation algorithm


Multi-layer neural networks Motivation

One layer neural networks have limited approximation capacity Example: XOR (parity function) cannot be represented by

using one layer but it can be represented by using two layers

Two different architectures to solve the same problem


Choosing the architecture

Let us consider an association problem (classification or approximation) characterized by:

• N input data• M output data

The neural network will have:• N input units• M output units• How many hidden units ? Difficult problem

Heuristic hint: use as few hidden units as possible !

Example: one hidden layer with unitsMN


Architecture and notationsFeedforward network with K layers

0 1 k

Input layer

Hidden layers Output layer

Y0=X0

… … KW1 W2 Wk Wk+1 WK

X1

Y1

F1

Xk

Yk

Fk

XK

YK

FK


FunctioningComputation of the output vector

)()(

)))(...((1

1111

kkkkk

KKKKK

YWFXFY

XWFWFWFY

FORWARD Algorithm (propagation of the input signal toward the output layer)

Y[0]:=X (X is the input signal)

FOR k:=1,K DO

X[k]:=W[k]Y[k-1]

Y[k]:=F(X[k])

ENDFOR

Rmk: Y[K] is the output of the network


A particular case

One hidden layer

Adaptive parameters: W1, W2

1

0

2

0

)1(1)2(2N

k

N

jjkjiki xwfwfy


Neural networks – universal approximators

Theoretical result:

Any continuous function T:DRN->RM can be approximated with arbitrary accuracy by a neural network having the following architecture:

• M input units• N output units• “Enough” hidden units having activation monotonously

increasing and bounded activation functions (e.g. sigmoidal functions)

The accuracy of the approximation depends on the number of hidden units.



Typical problems when solving approximation functions using neural networks:

• Representation problem: “can the network represent the desired function ?”– See the previous result

• Learning problem: “it is possible to find values for the adaptive parameters such that the desired function is approximated with the desired accuracy ?”– A training set and learning algorithm is needed

• Generalization problem: “ is the neural network able to extrapolate the knowledge extracted from the training set ?”– The training process should be carefully controlled in order to avoid

overtraining and enhance the generalization ability



Applications which can be interpreted as association (approximation problems):

• Classification problems (association between a pattern and a class label)– Architecture: input size = patterns size; output size = no. of classes;

hidden layer size = depending on the problem

• Prediction problems (based on a set of previous values of a time series estimate the next value)– Architecture: input size = number of previous values (predictors)

ouput size = 1 (one-dimensional prediction)

hidden layer size = depending on the problem

Example: y(t)=T(y(t-1),y(t-2), …, y(t-N))



• Compression problems (compress and decompress vectorial data)

Inputdata

Compresseddate

Output data

• Input size = output size

• Hidden layer size = input sized * compression ratio

Example: for a compression ratio of 1:2 the hidden layer will have the half size if the input layer

• Training set: {(X1,X1),…,(XL,XL)}

W1 W2


Learning processLearning based on minimizing a error function• Training set: {(x1,d1), …, (xL,dL)}• Error function (one hidden layer):

• Aim of learning process: find W which minimizes the error function • Minimization method: gradient method

2

1

2

1

2

0

1

0

12

2

1)(

L

l

N

i

N

k

N

kjkjik

li xwfwfd

LWE


Learning process

Gradient based adjustement

ijijij w

wEtwtw

)(

)()1(

2

1

2

1

2

0

1

0

12

2

1)(

L

l

N

i

N

k

N

kjkjik

li xwfwfd

LWE

xk

yk

xi

yi

El(W)


Learning process • Partial derivatives computation

2

1

2

1

1

0

0

0122

1)(

L

l

N

i

N

k

N

jjkjik

li xwfwfd

LWE

xk

yk

xi

yi

jlkj

N

i

liikkjkii

li

N

iik

kj

l

klikii

li

ik

l

xxwxfxxfxfydww

WE

yyxfydw

WE

2

1

'1

'1

'2

2

1

'2

)()()()()(

)()()(


The BackPropagation AlgorithmMain idea:

For each example in the training set:

- compute the output signal

- compute the error corresponding to the output level

- propagate the error back into the network and store the corresponding delta values for each layer

- adjust each weight by using the error signal and input signal for each layer

Computation of the output signal (FORWARD)

Computation of the error signal (BACKWARD)


The BackPropagation AlgorithmGeneral structure

Random initialization of weights

REPEAT

FOR l=1,L DO

FORWARD stage

BACKWARD stage

weights adjustement

ENDFOR

Error (re)computation

UNTIL <stopping condition>

Rmk.. • The weights adjustment

depends on the learning rate• The error computation needs

the recomputation of the output signal for the new values of the weights

• The stopping condition depends on the value of the error and on the number of epochs

• This is a so-called serial (incremental) variant: the adjustment is applied separately for each example from the training set

epoc

h


The BackPropagation AlgorithmBatch variant

Random initialization of weights

REPEAT

initialize the variables which will contain the adjustments

FOR l=1,L DO

FORWARD stage

BACKWARD stage

cumulate the adjustments

ENDFOR

Apply the cumulated adjustments

Error (re)computation

UNTIL <stopping condition>

Rmk.. • The incremental variant can be

sensitive to the presentation order of the training examples

• The batch variant is not sensitive to this order and is more robust to the errors in the training examples

• It is the starting algorithm for more elaborated variants, e.g. momentum variant

epoc

h


Problems of BackPropagation• Low convergence rate (the error decreases too slow)

• Oscillations (the error value oscillates instead of continuously decreasing)

• Local minima problems (the learning process is stuck in a local minima of the error function)

• Stagnation (the learning process stagnates even if it is not a local minima)

• Overtraining and limited generalization


Generalization capacityThe generalization capacity of a neural network

depends on the:

• Network architecture (e.g. number of hidden units)– A large number of hidden units can lead to overtraining (the

network extracts not only the useful knowledge but also the noise in data)

• The size of the training set– Too few examples are not enough to train the network

• The number of epochs (accuracy on the training set)– Too many epochs could lead to overtraining

Documents

Neural Networks - lecture 51 Multi-layer neural networks Motivation Choosing the architecture Functioning. FORWARD algorithm Neural networks as