Introduction to Artificial Neural Networks€¦ · Outline 3 Single-layer ANN Perceptron Adaline Limitations 4 Multi-layer ANN Topology Generalised Delta Rule Deﬁciencies 5 Recurrent

Introduction to Artificial Neural Networks

Dynerowicz Seweryn

Facultes Universitaires Notre-Dame de la Paix

27 March 2007

Dynerowicz Seweryn Introduction to Artificial Neural Networks

Outline

1 Introduction

2 FundamentalsBiological neuronArtificial neuronArtificial Neural Network


Outline

3 Single-layer ANNPerceptronAdalineLimitations

4 Multi-layer ANNTopologyGeneralised Delta RuleDeficiencies

5 Recurrent ANNJordan NetworkHopfield Network


Outline

6 ApplicationSmart sweepers

7 Conclusion


IntroductionFundamentals

1 Introduction




History

1943 : W. McCulloch & W. Pitts : first model of artificialneuron

1949 : D. Hebb describes the first learning rule (Hebb’s Law)

1957 : F. Rosenblatt designs the Perceptron

1965 : Nils J. Nilsson publishes ”Learning Machines”(automated learning fundamentals)



History

1969 : M. Minsky & S. Papert expose limitations ofPerceptron (XOR problem)

1975 : First multi-layer ANN with training algorithm(Cognitron)

1982 : Hopfield networks (J. Hopfield), Self-Organizing Map(T. Kohonen)

1986 : Backpropagation algorithm (R. Williams, D. Rumelhart& G. Hinton)



Type of problems

ANN can be used to solve certain types of problems :

Classification

Pattern recognition

Artificial Intelligence

Unknown function approximation

Model complicated functions

Stock exchange estimations

Data processing (filtering, clustering, ...)

...



Biological neuronArtificial neuronArtificial Neural Network

1 Introduction





A view of the biological neuron

The human nervous system is composed of about 1011

neurons.

The synapses are the connections between axon terminals anddendrites.

The synapses are characterized by a level of effectiveness.




A view of the biological neuron

The impulses received at each dendrites are summed together.

If the sum is above the stimulation threshold, the nucleusemits a spike down the axon.




A view of the artificial neuron

xi : input from unit i

wij : weight of the connection from unit i to j

θj : bias of unit j

ϕ : activation function

oj : state of activation of unit j




Propagation rule

Sigma unit type :

netj =n∑

i=1

wijxi

Sigma-pi unit type (Feldman & Ballard) :

netj =n∑

i=1

wij

m∏k=1

xik




Activation function

Threshold function (Heaviside or sgn) :

ϕ(v) ={

1 if v ≥ 00 if v < 0

Semi-linear function :

ϕ(v) =

1 if v ≥ 1

2v if − 1

2 < v < 12

0 if v ≤ −12

Sigmoid function :

ϕ(v) =1

1 + e−kv




Topology

Separation can be made between two types of ANN structure :

Feed-forward : acyclic and layer-decomposable graph(Perceptron, Adaline)

Recurrent (Jordan, Hopfield, Kohonen)




What is an ANN ?

ANN : I ⇒ O

I : set of inputs

O : set of outputs

Initially, the ANN will not respond correctly to given inputs.

Because

{weightsbiases

}are not adapted.




How does an ANN learns ?

Most interesting characteristic of ANN is the capacity to generalizeinformation from samples.

This generalisation occurs through the learning process.

Main idea : successive weight adjustment (gradient descentmethod)

Supervised learning (or Associative learning)

Unsupervised learning (or Self-organisation)




Supervised learning

Idea :

generate a population of input-output pairs

feed ANN with input

re-adjust weights and biases if the ANN doesn’t output whatis expected

Representation of data is imposed to the ANN




Unsupervised learning

Idea :

generate a population of input pairs

feed the population to the ANN

let it extract statistical properties from the population

Representation of data is defined by the ANN




Hebbian Learning Rule

wij = wij + δwij

δwij = γoioj

with :

wij : weight from unit i to unit j

γ : learning rate

oi : state of activation of unit i

Virtually, all learning rules can be considered as variants of HLR.




Learning Rate

Defines the speed at which the ANN will learn

Usually a constant

0 < γ ≤ 1

γ → 0 : slow convergence but stable solution.

γ → 1 : fast convergence but instable solution.




Over-fitting

Excessive learning or inadapted training set can lead to an”over-fitted” network.

An over-fitted ANN is specialised for the set which was used totrain it.

It lost a great part of its generalisation capability.




Two issues

Representational power : ability of an ANN to represent adesired function. Since an ANN is built from standardfunctions, it can only approximate the desired function, evenfor an optimal set of weights. Ergo, the approximation errorcan never be equal to 0.

Learning algorithm : given there exists a set of optimalweights (i.e. which minimize the approximation error), is therea procedure to compute them ?


Single-layer ANNMulti-layer ANNRecurrent ANN

PerceptronAdalineLimitations







Perceptron

Proposed by F. Rosenblatt in 1957.

A Perceptron is a single-layer ANN.

Composed of one or more output neurons, connected to all inputs.

Typically used as a linear classifier.




Simple case

Consider the following Perceptron :

1 neuron

2 inputs

1 output

Threshold-type activation function :

ϕ(v) ={

1 if v > 0−1 otherwise

We can use it as a classifier with a separation line :

w1x1 + w2x2 + θ = 0




Simple case

x2 = −w1

w2x1 −

θ

w2




Perceptron learning

Learning consists of a successive weight adjustment :

wij = wij + ∆wij

θj = θj + ∆θj

Problem : how to compute the ∆wij and ∆θj ?




Perceptron Learning Rule

Consider a set of learning samples (x, d(x)), with :

x : input vector

d(x) : desired output

Learning Method :

1. Start with random weights for the connections;

2. Select an input vector x from the set of training samples;

3. If o 6= d(x) (the perceptron gives an incorrect response),modify all connections wij according to :

∆wij = dj(x)xi

∆θj = dj(x)

4. Go back to 2.




Convergence Theorem

Theorem 1. If there exists a set of connection weights w∗ whichis able to perform the transformation o = d(x), the perceptronlearning rule will converge to some solution (which may or may notbe the same as w∗) in a finite number of steps for any initialchoice of the weights.




Numerical example

Initial parameters :

w1 = 1w2 = 2θ = −2

Set of samples :

Sample A : x = (0.5, 1.5) ; d(x) = 1

Sample B : x = (-0.5, 0.5) ; d(x) = -1

Sample C : x = (0.5, 0.5) ; d(x) = 1




Numerical example (cont’d)

Sample A :

net = 0.5 + 3− 2 = 1.5 > 0⇒ o = 1 3

Sample B :

net = −0.5 + 1− 2 = −1.5 ≤ 0⇒ o = −1 3





Sample C :

net = 0.5 + 1− 2 = −0.5 ≤ 0⇒ o = −1 7

Updated weights and bias :

w1 = w1 + ∆w1 = 1 + 0.5 = 1.5;

w2 = w1 + ∆w2 = 2 + 0.5 = 2.5;

θ = θ + ∆θ = −2 + 1 = −1








Adaptive Linear Element

Proposed by B. Widrow and T. Hoff in 1960.

Use a generalised version of the PLR, known as the Delta Rule.

Focus is put on netj instead of oj .




Delta rule (Widrow-Hoff)

Main idea : minimize the error in the output through gradientdescent.

∆wi = γ(dp − yp)xi

with

γ : learning rate

dp : expected output for input p

yp : obtained output for input p




Delta Rule derivation

Consider a single-layer ANN with an output unit using a linearactivation function ;

y =∑

i

wixi + θ

The objective is to minimize the total error given by :

E =12

∑p

(dp − yp)2

The idea is to adjust the weight proportionately to the negative ofthe derivative of the error with respect to each weight :

∆pwi = −γ∂Ep

∂wi




Delta Rule derivation (cont’d)

We can split the right derivative following the chain rule :

∂Ep

∂wi=

∂Ep

∂yp

∂yp

∂wi

The right derivative can be rewritten as :

∂yp

∂wi= xi

because of the linearity of the activation function.




Delta Rule derivation (cont’d)

The left derivative can be rewritten as :

∂Ep

∂yp= −(dp − yp)

We obtain the Delta Rule :

∆pwi = γ(dp − yp)xi




XOR Problem

If no linear separation exists, single-layer ANN cannot classifyproperly.

This limitation was exposed by Minsky and Papert through theXOR Problem.

It is impossible to teach a single-layer ANN to solve the XORProblem.

Solution : add hidden layers to the ANN.



TopologyGeneralised Delta RuleDeficiencies







Topology

A multi-layer ANN is composed of :

an input layer

one or more hidden layer(s)

an output layer

In most applications, a single hidden layer is used with sigmoidactivation functions.




Topology




Generalised Delta Rule

An important assumption made for the Delta Rule was the linearityof the activation function.

In a multi-layer ANN, this assumption no longer holds.

We must find a way to generalize the Delta Rule, so that it doesn’trestrain the weight adaptation to the output layer.




Derivation

Consider units with non-linear activation function :

ypk = ϕ(netpk)

wherenetpk =

∑i

wikypi + θk




Derivation (cont’d)

The modification we should apply to each weight is given by :

∆pwik = −γ∂Ep

∂wik

In which Ep, the total error is defined by :

Ep =12

No∑o=1

(dpo − yp

o)2

By using the chain rule we obtain :

∆pwik = −γ∂Ep

∂netpk

∂netpk∂wik






∂netpk∂wik

= ypi

If we define

δpk = − ∂Ep

∂netpk

we obtain an update rule which is similar to the Delta Rule :

∆pwik = γδpky

pi

The problem now is to define this δpk for the different unit k in the

network.





By using the chain rule, we rewrite δpk :

δpk = − ∂Ep

∂netpk= −∂Ep

∂ypk

∂ypk

∂netpk


∂ypk

∂netpk= ϕ′(netpk)

For the left derivative, we must consider two cases :

k is an output unit o ;

δpo = (dp

o − ypo)ϕ

′(netpo)





k is a hidden unit h ;

∂Ep

∂yph

= −No∑o=1

δpowho

We can use this to write :

δph = ϕ′(netph)

No∑o=1

δpowho





The two equations :

δpo = (dp

o − ypo)ϕ

′(netpo) (1)

δph = ϕ′(netph)

No∑o=1

δpowho (2)

Define a recursive procedure which can be used to adjust theweights of the network. It constitutes the Generalised Delta Rulefor a feed-forward network of non-linear units.




Learning rate and Momentum

In order to have fast convergence with a stable solution, amomentum term is added to the variation of the weight.

∆wjk(t + 1) = γδpky

pj + α∆wjk(t)

Instability of the solution is countered because the change in theweights is dependant of the previous change.It is possible to increase the learning rate γ without causingoscillation in the solution.




Learning rate and Momentum

a) γ → 0b) γ → 1c) γ → 1 with a momentum term added




Deficiencies

Network paralysis : as the network is trained, the weightscan increase to very high values (either positive or negative),so does netj . Because of the sigmoid function, the activationwill be very close to zero or very close to one. In that case,the back-propagation algorithm may come to a standstill.

Local Minima : because of the shape of the error functionfor a complex network, the gradient method can find itselftrapped in a local minima. Some methods (probabilistic) canavoid this problem but are very slow. It is also possible toincrease the number of hidden units without going beyond acertain threshold.



Jordan NetworkHopfield Network







Jordan Network

Proposed by Jordan in 1986.

Activation values of the output units are fed back into the inputlayer through so-called state units.

The weights from output units to these state units is fixed to +1.

Thus, the learning rules which apply to multi-layer ANN can beused to train Jordan Networks.




Jordan Network




Hopfield Network

Proposed by J. Hopfield in 1982.

Consists of a fully-interconnected network of N neurons which areboth input and output.

Updates are made asynchronously and independently.

Activation values are binary (+1 / -1).

Can be used as an associative memory or for optimisation problems(salesman problem).




Hopfield ”Auto-Associator”


ApplicationConclusion

Smart sweepers

6 ApplicationSmart sweepers

7 Conclusion



Smart sweepers

Smart sweepers

Objective : train minesweepers to pick up mines in a2-dimensionnal field.

Parameters of the network :

Topology : Feed-forward multi-layer ANN.

4 input units6 hidden units on one layer2 output units

Activation function : sigmoid function.

Learning rule : genetic algorithm.



Smart sweepers

Smart sweepers

Input is composed of two vectors :

Vector defining the direction of the closest mine.

Vector defining the direction towards which the minesweeperis pointing.

Output is composed of two components, the speeds of left andright track.



Smart sweepers

Smart sweepers

Each minesweeper has its own set of weights.

The ANN works for a certain amount of time T . During this periodof time, each mine found increases the fitness of the sweeper.

Afterwards, the GA starts running to create the new generation ofweight sets.



Smart sweepers

Smart sweepers



Smart sweepers

Smart sweepers



Conclusion

Single-layer ANN :

Limited representationnal power : restricted to linear classifier

Linearity of the system ⇒ convergence to the optimal solution(optimal weight vector)

Multi-layer ANN :

Unlimited representationnal power : can model non-linearproblems

Non-linearity doesn’t guarantee the convergence to an optimalsolution



Conclusion

Choice of a representative learning sample is essential toobtain the ”expected” behavior and to avoid over-fitting.

Combination with other approaches can prove to be effective(genetic algorithms for instance)



References

B. Krose, P. van der SmagtAn introduction to Neural Networks, Eighth edition.

S. SinghNeural Network Recognition of Hand-printed Characters.


Documents

Introduction to Artificial Neural Networks€¦ · Outline 3 Single-layer ANN Perceptron Adaline Limitations 4 Multi-layer ANN Topology Generalised Delta Rule Deﬁciencies 5 Recurrent