113
CS407 Neural Computation Lecture 4: Single Layer Perceptron (SLP) Classifiers Lecturer: A/Prof. M. Bennamoun

Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Embed Size (px)

Citation preview

Page 1: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

CS407 Neural Computation

Lecture 4: Single Layer Perceptron (SLP)

Classifiers

Lecturer: A/Prof. M. Bennamoun

Page 2: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

OutlineWhat’s a SLP and what’s classification?

Limitation of a single perceptron.

Foundations of classification and Bayes Decision making theory

Discriminant functions, linear machine and minimum distance classification

Training and classification using the Discrete perceptron

Single-Layer Continuous perceptron Networks for linearly separable classifications

Appendix A: Unconstrained optimization techniques

Appendix B: Perceptron Convergence proof

Suggested reading and references

Page 3: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

What is a perceptron and what is a Single Layer Perceptron (SLP)?

Page 4: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

PerceptronThe simplest form of a neural network consists of a single neuron with adjustable synaptic weights and biasperforms pattern classification with only two classesperceptron convergence theorem :– Patterns (vectors) are drawn from two

linearly separable classes– During training, the perceptron algorithm

converges and positions the decision surface in the form of hyperplane between two classes by adjusting synaptic weights

Page 5: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

What is a perceptron?

wk1x1

wk2x2

wkmxm

... ... Σ

Biasbk

ϕ(.)vk

Inputsignal

Synapticweights

Summingjunction

Activationfunction

bxwv kj

m

jkjk += ∑

=1

)(vy kkϕ=

)()( ⋅=⋅ signϕDiscrete Perceptron:

Outputyk

shapeS −=⋅)(ϕContinous Perceptron:

Page 6: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Activation Function of a perceptron

vi

+1

-1

vi

+1

Signum Function (sign)

shapesv −=)(ϕContinous Perceptron:

)()( ⋅=⋅ signϕDiscrete Perceptron:

Page 7: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

SLP Architecture

Single layer perceptron

Input layer Output layer

Page 8: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Where are we heading? Different Non-Linearly Separable Problems http://www.zsolutions.com/light.htm

StructureTypes of

Decision RegionsExclusive-OR

ProblemClasses with

Meshed regionsMost General

Region Shapes

Single-Layer

Two-Layer

Three-Layer

Half PlaneBounded ByHyperplane

Convex OpenOr

Closed Regions

Arbitrary(Complexity

Limited by No.of Nodes)

A

AB

B

A

AB

B

A

AB

B

BA

BA

BA

Page 9: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Review from last lectures:

Page 10: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Implementing Logic Gates with Perceptrons http://www.cs.bham.ac.uk/~jxb/NN/l3.pdf

We can use the perceptron to implement the basic logic gates (AND, OR and NOT).

All we need to do is find the appropriate connection weights and neuron thresholds to produce the right outputs for each set of inputs.

We saw how we can construct simple networks that perform NOT, AND, and OR.

It is then a well known result from logic that we can construct any logical function from these three operations.

The resulting networks, however, will usually have a much more complex architecture than a simple Perceptron.

We generally want to avoid decomposing complex problems into simple logic gates, by finding the weights and thresholds that work directly in a Perceptron architecture.

Page 11: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Implementation of Logical NOT, AND, and ORIn each case we have inputs ini and outputs out, and need to determine

the weights and thresholds. It is easy to find solutions by inspection:

Page 12: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

The Need to Find Weights AnalyticallyConstructing simple networks by hand is one thing. But what about

harder problems? For example, what about:

How long do we keep looking for a solution? We need to be able to calculate appropriate parameters rather than looking for solutions by trial and error.

Each training pattern produces a linear inequality for the output in terms of the inputs and the network parameters. These can be used to compute the weights and thresholds.

Page 13: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Finding Weights Analytically for the AND Network

We have two weights w1 and w2 and the threshold θ, and for each training pattern we need to satisfy

So the training data lead to four inequalities:

It is easy to see that there are an infinite number of solutions. Similarly, there are an infinite number of solutions for the NOT and OR networks.

Page 14: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Limitations of Simple Perceptrons

We can follow the same procedure for the XOR network:

Clearly the second and third inequalities are incompatible with the fourth, so there is in fact no solution. We need more complex networks, e.g. that combine together many simple networks, or use different activation/thresholding/transfer functions.

It then becomes much more difficult to determine all the weights and thresholds by hand.

These weights instead are adapted using learning rules. Hence, need to consider learning rules (see previous lecture), and more complexarchitectures.

Page 15: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

E.g. Decision Surface of a Perceptron

+

+-

-

x1

x2

Non-Linearly separable

• Perceptron is able to represent some useful functions• But functions that are not linearly separable (e.g. XOR)

are not representable

+

++

+ -

-

-

-

x2

Linearly separable

x1

Page 16: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

What is classification?

Page 17: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Classification ?http://140.122.185.120

Pattern classification/recognition

- Assign the input data (a physical object, event, or phenomenon)to one of the pre-specified classes (categories)

The block diagram of the recognition and classification system

Page 18: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Classification: an example

• Automate the process of sorting incoming fish on a conveyor belt according to species (Salmon or Sea bass).

Set up a camera Take some sample images Note the physical differences between the two typesof fish

Length Lightness Width No. & shape of fins ( “sanfirim”)Position of the mouth

http://webcourse.technion.ac.il/236607/Winter2002-2003/en/ho.htmDuda & Hart, Chapter 1

Page 19: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Classification an example…

Page 20: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Classification: an example…

• Cost of misclassification: depends on application Is it better to misclassify salmon as bass or vice versa?

Put salmon in a can of bass loose profit Put bass in a can of salmon loose customer

There is a cost associated with our decision.Make a decision to minimize a given cost.

• Feature Extraction: Problem & Domain dependent Requires knowledge of the domain A good feature extractor would make the job of the classifier trivial.

⇒⇒

Page 21: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Bayesian decision theory

Page 22: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Bayesian Decision Theoryhttp://webcourse.technion.ac.il/236607/Winter2002-2003/en/ho.htmlDuda & Hart, Chapter 2

Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification.

Decision making when all the probabilistic information is known.For given probabilities the decision is optimal.When new information is added, it is assimilated in optimal fashion for improvement of decisions.

Page 23: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Bayesian Decision Theory …

Fish Example:Each fish is in one of 2 states: sea bass or salmonLet ω denote the state of nature

ω = ω1 for sea bassω = ω2 for salmon

Page 24: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Bayesian Decision Theory …The State of nature is unpredictable ω is a variable that must be described probabilistically.If the catch produced as much salmon as sea bass the next fish is equally likely to be sea bass or salmon.Define

P(ω1 ) : a priori probability that the next fish is sea bassP(ω2 ): a priori probability that the next fish is salmon.

Page 25: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Bayesian Decision Theory …

If other types of fish are irrelevant: P( ω1 ) + P( ω2 ) = 1.

Prior probabilities reflect our prior knowledge (e.g. time of year, fishing area, …)Simple decision Rule:

Make a decision without seeing the fish.Decide w1 if P( ω1 ) > P( ω2 ); ω2 otherwise.OK if deciding for one fishIf several fish, all assigned to same class.

Page 26: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Bayesian Decision Theory ...

In general, we will have some features and more information.Feature: lightness measurement = x

Different fish yield different lightness readings (x is a random variable)

Page 27: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Bayesian Decision Theory ….

Define

p(x|ω1) = Class Conditional Probability DensityProbability density function for x given that the state of nature is ω1

The difference between p(x|ω1 ) and p(x|ω2 ) describes the difference in lightness between sea bass and salmon.

Page 28: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Class conditioned probability density: p(x|ω)

Hypothetical class-conditional probabilityDensity functions are normalized (area under each curve is 1.0)

Page 29: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Suppose that we know The prior probabilities P(ω1 ) and P(ω2 ), The conditional densities and Measure lightness of a fish = x.

What is the category of the fish ?

1( | )p x ω 2( | )p x ω

( | )jp xω

Bayesian Decision Theory ...

Page 30: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Bayes FormulaGiven– Prior probabilities P(ωj)– Conditional probabilities p(x| ωj)

Measurement of particular item– Feature value x

Bayes formula:

(from ) where

so

)()()|(

)(xpPxp

xP jjj

ωωω =

∑=i

ii Pxpxp )()|()( ωω∑ =

ii xP 1)|(ω

)()|()()|(),( xpxPPxpxp jjjj ωωωω ==

Likelihood PriorPosteriorEvidence

∗=

Page 31: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Bayes' formula ...

• p(x|ωj ) is called the likelihood of ωj with respect to x.

(the ωj category for which p(x|ωj ) is large is more "likely" to be the true category)

•p(x) is the evidencehow frequently we will measure a pattern with feature value x.Scale factor that guarantees that the posterior probabilities sum to 1.

Page 32: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Posterior Probability

Posterior probabilities for the particular priors P(ω1)=2/3 and P(ω2)=1/3. At every x the posteriors sum to 1.

Page 33: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Error

2 1

1 2

If we decide ( | )( | )

If we decide ( | )P x

P error xP x

ω ωω ω

⇒= ⇒

For a given x, we can minimize the probability of error by deciding ω1 if P(ω1|x) > P(ω2|x) and ω2 otherwise.

Page 34: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Bayes' Decision Rule(Minimizes the probability of error)

ω1 : if P(ω1|x) > P(ω2|x) i.e.ω2 : otherwise

orω1 : if P ( x |ω1) P(ω1) > P(x|ω2) P(ω2) ω2 : otherwise

andP(Error|x) = min [P(ω1|x) , P(ω2|x)]

)()( 21

2

1

xPxP ωωω

ω

<>

Likelihood ratio

)()(

)|()|()()|()()|(

2

1

2

12211

2

1

2

1

ωω

ωωωωωω

ω

ω

ω

ω

PP

xpxpPxpPxp

<>

⇔<>

Threshold

Page 35: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Decision Boundaries

Classification as division of feature space into non-overlapping regions

Boundaries between these regions are known as decision surfaces or decision boundaries

kk

R

toassignedxXxthatsuchXX

ω↔∈,,1 K

Page 36: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Optimum decision boundariesCriterion: – minimize miss-classification– Maximize correct-classification

)()(yprobabilitposteriormaximum

..

)()()()(

xPxPkj

ei

PxpPxp

kjifXxClassify

jk

jjkk

k

ωω

ωωωω

>≠∀

>

≠∀∈

2

)()(

),()(

1

1

=

∈=

∈=

=

=

RHere

PXxP

XxPcorrectP

R

kkkk

R

kkk

ωω

ω

Page 37: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discriminant functions

Discriminant functions determine classification by comparison of their values:

Optimum classification: based on posterior probabilityAny monotone function g may be applied without changing the decision boundaries

)()( xgxgkjifXxClassify

jk

k

>≠∀∈

))(ln()(..

))(()(

xPxgge

xPgxg

kk

kk

ω

ω

=

=

)( xP kω

Page 38: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

The Two-Category CaseUse 2 discriminant functions g1 and g2, and assigning x to ω1 if g1>g2. Alternative: define a single discriminant function g(x) = g1(x) -g2(x), decide ω1 if g(x)>0, otherwise decide ω2. Two category case

1 2

1 1

2 2

( ) ( | ) ( | )( | ) ( )( ) ln ln( | ) ( )

g P Pp Pgp P

ω ωω ωω ω

= −

= +

x x xxxx

Page 39: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Summary

Bayes approach:– Estimate class-conditioned probability density– Combine with prior class probability– Determine posterior class probability– Derive decision boundaries

Alternate approach implemented by NN– Estimate posterior probability directly– i.e. determine decision boundaries directly

Page 40: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

DISCRIMINANT FUNCTIONS

Page 41: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discriminant Functions http://140.122.185.120

Determine the membership in a category by the classifier based on the comparison of R discriminantfunctions g1(x), g2(x),…, gR(x)

– When x is within the region Xk if gk(x) has the largest value

Do not mix between n = dim of each I/P vector (dim of feature space); P= # of I/P vectors; and R= # of classes.

Page 42: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discriminant Functions…

Page 43: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discriminant Functions…

Page 44: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discriminant Functions…

Page 45: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discriminant Functions…

Page 46: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discriminant Functions…

Page 47: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification

• Find the linear-form discriminant function for two class classification when the class prototypes are known

• Example 3.1: Select the decision hyperplane that contains the midpoint of the line segment connecting center point of two classes

Page 48: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification… (dichotomizer)

•The dichotomizer’s discriminant function g(x):

t

Page 49: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification…(multiclass classification)

•The linear-form discriminant functions for multiclassclassification

– There are up to R(R-1)/2 decision hyperplanes for Rpairwise separable classes

(i.e. next to or touching another)

Page 50: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification… (multiclass classification)•Linear machine or minimum-distance classifier

– Assume the class prototypes are known for all classes

• Euclidean distance between input pattern x and thecenter of class i, Xi:

t

Page 51: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification… (multiclass classification)

Page 52: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification…

Note: to find S12 we need to compute (g1-g2)

P1, P2, P3 are the centres of gravity of the prototype points, we need to design a minimum distance classifier. Using the formulas from the previous slide, we get wi

Page 53: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification…

•If R linear discriminant functions exist for a set of patterns such that

( ) ( )ji ,..., R, j,..., R,, i

i,gg ji

≠==

∈>

,2121

Classfor xxx

•The classes are linearly separable.

Page 54: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification… Example:

Page 55: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification… Example…

Page 56: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification…

•Examples 3.1 and 3.2 have shown that the coefficients (weights) of the linear discriminant functions can be determined if the a priori information about the sets of patterns and their class membership is known

•In the next section (Discrete perceptron) we will examine neural networks that derive their weights during the learning cycle.

Page 57: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification…•The example of linearly non-separable patterns

Page 58: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification…

Input space (x)

Image space (o)

)1sgn( 211 ++= xx o

Page 59: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Machine and Minimum DistanceClassification…

)1sgn( 211 ++= xx o

)1sgn( 212 +−−= xx o

-111111-11111-11-1-1-1o2o1x2x1

These 2 inputs map to the same point (1,1) in the image

space

Page 60: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

The Discrete Perceptron

Page 61: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm

• So far, we have shown that coefficients of linear discriminant functions called weights can be determined based on a priori information about sets of patterns and their class membership.

•In what follows, we will begin to examine neural network classifiers that derive their weights during the learning cycle.

•The sample pattern vectors x1, x2, …, xp, called the training sequence, are presented to the machine along with the correct response.

Page 62: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm- Geometrical Representations http://140.122.185.120

Zurada, Chapter 3

(Intersects the origin point w=0)

5 prototype patterns in this case: y1, y2, …y5If dim of augmented pattern vector is > 3, our power of visualization are no longer of assistance. In this case, the only recourse is to use the analytical approach.

Page 63: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm- Geometrical Representations…•Devise an analytic approach based on the geometrical representations

– E.g. the decision surface for the training pattern y1

If y1 in Class 1:

y1 in Class 2

( ) 11 yyww =∇ t

11 yww c+=′

y1 in Class 1

If y1 in Class 2:

11 yww c−=′

c (>0) is the correction increment (is two times the learning constant ρintroduced before)

Weight Space

Weight Space

c controls the size of adjustment

Gradient(the direction ofsteepest increase)

(see previous slide)

(correction in negative gradient direction)

Page 64: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm- Geometrical Representations…

Page 65: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm- Geometrical Representations…

Note 2: c is not constant and depends on the current training pattern as expressed by eq. Above.Note 1: p=distance so >0

pc t

t

== yyyyw

y1

Page 66: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm- Geometrical Representations…

•For fixed correction rule: c=constant, the correction of weights is always the same fixed portion of the current training vector

– The weight can be initialised at any value

•For dynamic correction rule: c depends on the distance from the weight (i.e. the weight vector) to the decision surface in the weight space. Hence

– The initial weight should be different from 0.(if w1=0, then cy =0 and w’=w1+cy=0, therefore no possible adjustments).

Current input pattern

Current weight

Page 67: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm- Geometrical Representations…

•Dynamic correction rule: Using the value of c from previous slide as a reference, we devise an adjustment technique which depends on the length w2-w1

Νote: λ is the ratio of the distance between the old weight vector w1

and the new w2, to the distance from w1 to the pattern hyperplane

λ=2: Symmetrical reflection w.r.t decision plane

λ=0: No weight adjustment

Page 68: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm- Geometrical Representations…•Example:

2 class : 1,2,5.01 class : 1,3,1

4242

3131

−==−=−=====

ddxxddxx

•The augmented input vectors are:

−=

=

−=

=

12

13

,1

5.0,

11

4321 yyyy

•The decision lines wtyi=0, for i=1, 2, 3, 4 are sketched on the augmented weight space as follows:

Page 69: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm- Geometrical Representations…

Page 70: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm- Geometrical Representations…

[ ]t75.15.2 and 1cFor 1 −== w

•Using the weight training with each step can be summarized as follows:

yww c±='

kkkt

kk dc yyww )]sgn([

2−=∆

•We obtain the following outputs and weight updates:

•Step 1: Pattern y1 is input

−=+=

=−

−=

−=

75.25.1

2

111

]75.15.2[sgn

112

11

1

yww

od

o

Page 71: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm- Geometrical Representations…

•Step 2: Pattern y2 is input

−=−=

−=−

=

−−=

75.11

2

11

5.0]75.25.1[sgn

223

22

2

yww

od

o

•Step 3: Pattern y3 is input

=+=

=−

−=

−=

75.22

2

113

]75.11[sgn

334

33

3

yww

od

o

Page 72: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Discrete Perceptron Training Algorithm- Geometrical Representations…

• Since we have no evidence of correct classification of weight w4 the training set consisting of an ordered sequence of patterns y1 ,y2 and y3 needs to be recycled. We thus have y4= y1 , y5= y2, etc (the superscript is used to denote the following training step number).

•Step 4, 5: w6 = w5 = w4 (no misclassification, thus no weight adjustments).

•You can check that the adjustment following in steps 6 through 10 are as follows: [ ]

[ ]t

t

75.03

75.15.2

11

78910

7

=

===

=

w

wwwww

w11 is in solution area.

Page 73: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

The Continuous Perceptron

Page 74: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Continuous Perceptron Training Algorithmhttp://140.122.185.120Zurada, Chapter 3

•Replace the TLU (Threshold Logic Unit) with the sigmoid activation function for two reasons:

– Gain finer control over the training procedure

– Facilitate the differential characteristics to enable computation of the error gradient

(of current error function)

The factor ½ does not affect the location of the error minimum

Page 75: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Continuous Perceptron Training Algorithm…

•The new weights is obtained by moving in the direction of the negative gradient along the multidimensional error surface

By definition of the steepest descent concept, each elementary move should be perpendicular to the current error contour.

Page 76: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Continuous Perceptron Training Algorithm…•Define the error as the squared difference between the desired output and the actual output

1,...,2,1)( have we, Since +==∂

∂= niy

wnetnet i

i

tyw

Training rule of continous perceptron (equivalent to delta training rule)

Page 77: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Continuous Perceptron Training Algorithm…

Page 78: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Continuous Perceptron Training Algorithm…Same as previous example (of discrete perceptron) but with a continuous activation function and using the delta rule.

Same training pattern set as discrete perceptron example

Page 79: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Continuous Perceptron Training Algorithm…2

1)exp(1

221

−+−= k

kk net

dEλ

[ ]

2

211 1

)(exp121

21)(

+−+−=

wwE

λw

form following the toexpression thissimplifies terms thereducing and 1=λ

[ ]221

1 )exp(12)(

wwE

++=w

similarly

[ ]221

2 )5.0exp(12)(

wwE

−+=w

[ ]221

3 )3exp(12)(

wwE

++=w [ ]2

214 )2exp(1

2)(ww

E−+

=w

These error surfaces are as shown on the previous slide.

Page 80: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Continuous Perceptron Training Algorithm…

minimum

Page 81: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Mutlicategory SLP

Page 82: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Multi-category Single layer Perceptron nets•Treat the last fixed component of input pattern vector as the neuron activation threshold…. T=wn+1

yn+1= -1 (irrelevant wheter it is equal to +1 or –1)

Page 83: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Multi-category Single layer Perceptron nets…• R-category linear classifier using R discrete bipolarperceptrons

– Goal: The i-th TLU response of +1 is indicative of class i and all other TLU respond with -1

Page 84: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Multi-category Single layer Perceptron nets…•Example 3.5

t1) 1,- (-1, be should

Indecision regions = regions where no class membership of an input pattern can be uniquely determined based on the response of the classifier (patterns in shaded areas are not assigned any reasonable classification. E.g. point Q for which o=[1 1 –1]t => indecisive response). However no patterns such as Q have been used for training in the example.

Page 85: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Multi-category Single layer Perceptron nets…[ ] [ ]tt 131 and 210 1

312 −=−= ww[ ]t021 and 1cFor 1

1 −== w

•Step 1: Pattern y1 is input

[ ]

[ ]

[ ] *11

210

131sgn

11

210

210sgn

11

210

021sgn

=

−−

−=

−−

=

−− Since the

only incorrect response is provided by TLU3, we have

−=

−−

−=

=

=

019

12

10

131

23

12

22

11

21

w

ww

ww

Page 86: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Multi-category Single layer Perceptron nets…

•Step 2: Pattern y2 is input

[ ]

[ ]

[ ] 115

2019sgn

115

2210sgn

*115

2021sgn

−=

−−−

=

−−−

=

−−−

23

33

22

32

31

131

15

2

021

ww

ww

w

=

=

−=

−−−

=

Page 87: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Multi-category Single layer Perceptron nets…

•Step 3: Pattern y3 is input

33

43

32

42

41

22

4

ww

ww

w

=

=

−=

One can verify that the only adjusted weights from now on are those of TLU1

( )( )( ) 1sgn

1sgn

*1sgn

333

332

331

=

−=

=

yw

yw

yw

t

t

t

• During the second cycle:

−=

=

=

42

7

332

71

61

41

51

w

w

ww

=

=

535

91

71

81

w

ww

Page 88: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Multi-category Single layer Perceptron nets…

•R-category linear classifier using R continuous bipolar perceptrons

Page 89: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Comparison between Perceptron and Bayes’ ClassifierPerceptron operates on the promise that the patterns to be classified are linear separable (otherwise the training algorithm will oscillate), while Bayes classifier can work on nonseparablepatternsBayes classifier minimizes the probability of misclassification which is independent of the underlying distributionBayes classifier is a linear classifier on the assumption of GaussianityThe perceptron is non-parametric, while Bayes classifier is parametric (its derivation is contingent on the assumption of the underlying distributions)The perceptron is adaptive and simple to implementthe Bayes’ classifier could be made adaptive but at the expense of increased storage and more complex computations

Page 90: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

APPENDIX A

Unconstrained Optimization Techniques

Page 91: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Unconstrained Optimization Techniqueshttp://ai.kaist.ac.kr/~jkim/cs679/Haykin, Chapter 3

Cost function E(ww)– continuously differentiable– a measure of how to choose ww of an adaptive

filtering algorithm so that it behaves in an optimum manner

we want to find an optimal solution ww* that minimize E(ww)– local iterative descent :

starting with an initial guess denoted by ww(0), generate a sequence of weight vectors ww(1), ww(2), …, such that the cost function E(ww) is reduced at each iteration of the algorithm, as shown by

E(ww(n+1)) < E(ww(n))– Steepest Descent, Newton’s, Gauss-Newton’s

methods

0*)( =∇ wE r

Page 92: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Method of Steepest DescentHere the successive adjustments applied to w w are in the direction of steepest descent, that is, in a direction opposite to the grad(E(ww))

ww(n+1) = ww(n) - a gg(n)a : small positive constant called step size or

learning-rate parameter.gg(n) : grad(E(ww))

The method of steepest descent converges to the optimal solution w* w* slowlyThe learning rate parameter a has a profound influence on its convergence behavior– overdamped, underdamped, or even

unstable(diverges)

Page 93: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Newton’s MethodUsing a second-order Taylor series expansion of the cost function around the point ww(n)∆E(ww(n)) = E(ww(n+1)) - E(ww(n))

~ ggT(n) ∆ww(n) + 1/2 ∆wwT(n) HH(n) ∆ww(n) where ∆ww(n) = ww(n+1) - ww(n) ,

HH(n) : Hessian matrix of E(n)We want ∆ww**(n) that minimize ∆E(ww(n)) so differentiate ∆E(ww(n)) with respect to ∆ww(n) :

gg(n) + HH(n) ∆ww**(n) = 0so,

∆ww**(n) = -HH--11(n) gg(n)

Page 94: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Newton’s Method…

Finally, ww(n+1) = ww(n) + ∆ww(n)

= ww(n) - HH--11(n) gg(n)

Newton’s method converges quickly asymptotically and does not exhibit the zigzagging behavior the Hessian HH(n) has to be a positive definite matrix for all n

Page 95: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Gauss-Newton MethodThe Gauss-Newton method is applicable to a cost function

Because the error signal e(i) is a function of ww, we linearize the dependence of e(i) on ww by writing

Equivalently, by using matrix notation we may write

∑=

=n

iiewE

1

2 )(21)( r

))(()()(),(')(

nwwwieiewie

T

nww

rrr

r

rr−

∂∂

+==

))()(()(),(' nwwnJnewne rrrrr−+=

Page 96: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Gauss-Newton Method…where JJ(n) is the n-by-m Jacobian matrix of

ee(n) (see bottom of this slide)We want updated weight vector ww(n+1) defined by

simple algebraic calculation tells…

Now differentiate this expression with respect to ww and set the result to 0, we obtain

=+ 2),('

21minarg)1( wnenw

w

rrrr

))()(()())((21))()(()()(

21),('

21 22 nwwnJnJnwwnwwnJnenewne TTT rrrrrrrrrr

−−+−+=

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

=

M

M

M

wne

wne

wne

wke

wke

wke

we

we

we

J

)()()(

)()()(

)1()1()1(

1

1

1

LL

MLMLM

LL

MLMLM

LL

α

α

α

Page 97: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Gauss-Newton Method…

0))()(()()()( =−+ nwwnJnJnenJ TT rrr

Thus we get

To guard against the possibility that the matrix product JJT(n)JJ(n) is singular, the customary practice is

where is a small positive constant.This modification effect is progressively reduced as the number of iterations, n, is increased.

δ

)()())()(()()1( 1 nenJnJnJnwnw TT rrr −−=+

)()())()(()()1( 1 nenJInJnJnwnw TT rrr −+−=+ δ

Page 98: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Linear Least-Squares FilterThe single neuron around which it is built is linearThe cost function consists of the sum of error squaresUsing and the error vector is

Differentiating with respect to

correspondingly,

From Gauss-Newton method, (eq. 3.22)

)()()( iiiy T wx= )()()( iyidie −=

)()()()( nnnn wXde −=

)(nw)()( nn TXe −=∇

)()( nn XJ −=

)()()()())()(()1( 1 nnnnnnn TT dXdXXXw +− ==+

Page 99: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

LMS AlgorithmBased on the use of instantaneous values for cost function :

Differentiating with respect to ,

The error signal in LMS algorithm :

hence,so,

)(21)( 2 neE =w

w

www

∂∂

=∂

∂ )()()( neneE

)()()()( nnndne T wx−=

)()()( n

nne x

w−=

∂∂ )()(

)()( nen

nE xw

w−=

∂∂

Page 100: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

LMS Algorithm …Using as an estimate for the gradient vector,

Using this for the gradient vector of steepest descent method, LMS algorithm as follows :

– : learning-rate parameterThe inverse of is a measure of the

memory of the LMS algorithm– When is small, the adaptive process

progress slowly, more of the past data are remembered and a more accurate filtering action’

)()(

nEw

w∂∂

)()()(ˆ nenn xg −=

)()()(ˆ)1(ˆ nennn xww η+=+η

η

η

Page 101: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

LMS CharacteristicsLMS algorithm produces an estimate of the weight vector– Sacrifice a distinctive feature

• Steepest descent algorithm : follows a well-defined trajectory

• LMS algorithm : follows a random trajectory

– Number of iterations goes infinity, performs a random walk

But importantly, LMS algorithm does not require knowledge of the statistics of the environment

)(nw

)(ˆ nw

)(ˆ nw

Page 102: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Convergence ConsiderationTwo distinct quantities, and determine the convergence– the user supplies , and the selection of

is important for the LMS algorithm to converge

Convergence of the meanas

– This is not a practical valueConvergence in the mean square

Convergence condition for LMS algorithm in the mean square

η )(nx

)(nxη

[ ] 0)(ˆ ww →nE ∞→n

[ ] ∞→→ nneE as constant )(2

inputssensor theof valuessquare-mean of sum20 <<η

Page 103: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

APPENDIX B

Perceptron Convergence Proof

Page 104: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Perceptron Convergence Proof Haykin, Chapter 3

Consider the following perceptron:

)()(

)()()(0

nn

nxnwnv

T

m

iii

xw=

= ∑=

1C class tobelonging or input vectevery for 0 xxw >T

2C class tobelonging or input vectevery for 0 xxw ≤T

Page 105: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Perceptron Convergence Proof…The algorithm for the weight adjustment for the perceptron– if x(n) is correctly classified no adjustments to w

– otherwise

– learning rate parameter controls adjustment applied to weight vector

1C class tobelongs )( and 0)( if )()1( nnnn T xxwww >=+

2C class tobelongs )( and 0)( if )()1( nnnn T xxwww ≤=+

C class tobelongs )( and 0)( if )()()()1( nnnnnn T xxwxww >−=+ η

C class tobelongs )( and 0)( if )()()()1( nnnnnn T xxwxww ≤+=+ η

)(nη

2

1

Page 106: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Perceptron Convergence ProofFor 0w == )0(and1)(nηSuppose the perceptron incorrectly classifies the vectors

such that),...2(),1( xx

1C tobelonging (n)for )()()1( 1 sinceBut

)()()()1( : thatso 0)(

xxww

xwwxw

nnn

nnnnnT

+=+⇒=

+=+≤η

η

)1B()(...)2()1()1()1( find y weiterativel ,(0) Since

nnn

xxxww0w

+++=++=

Since the classes C1 and C2 are assumed to be linearly separable, there exists a solution w0 for which wTx(n)>0 for the vectors x(1), …x(n) belonging to the subset H1(subset of training vectors that belong to class C1).

Page 107: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Perceptron Convergence Proof

)2()(min 0)( 1

BnT

Hnxw

x ∈=α

For a fixed solution w0, we may then define a positive number α as

)(...)2()1()1( implies above (B1)equation Hence

T0

T0

T0

T0 nn xwxwxwww +++=+

Using equation B2 above, (since each term is greater or equal than α), we have

αnn ≥+ )1(T0 ww

Now we use the Cauchy-Schwartz inequality:

0for ).(

or ).(

22

22

222

≠≥

bbbaa

baba

Page 108: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Perceptron Convergence ProofThis implies that:

)3()1( 20

222 Bnn

ww α

≥+

Now let’s follow another development route (notice index k)

1H (k) and 1for )()()1( ∈=+=+ xxww , ..., nkkkkBy taking the squared Euclidean norm of both sides, we get:

)()(2)()()1( 222 kkkkk T xwxww ++=+

But under the assumption the the perceptron incorrectlyclassifies an input vector x(k) belonging to the subset H1, we have :hence and 0)()( <kkT xw

222 )()()1( kkk xww +≤+

Page 109: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Perceptron Convergence ProofOr equivalently,

nkkkk ,...1;)()()1( 222 =≤−+ xww

Adding these inequalities for k=1,…n, and invoking the initial condition w(0)=0, we get the following inequality:

)4()()1(1

22 Bnknn

kβ≤≤+ ∑

=

xw

Where β is a positive number defined by;

∑=

∈=

n

kHkk

1

2

)()(max

1

xx

β

Eq. B4 states that the squared Euclidean norm of w(n+1) grows at most linearly with the number of iterations n.

Page 110: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Perceptron Convergence Proof

The second result of B4 is clearly in conflict with Eq. B3.•Indeed, we can state that n cannot be larger than some value nmax for which Eq. B3 and B4 are both satisfied with the equality sign. That is nmax is the solution of the eq.

•Solving for nmax given a solution w0, we find that

We have thus proved that for η(n)=1 for all n, and for w(0)=0, given that a sol’ vector w0 exists, the rule for adapting the synaptic weights of the perceptron must terminate after at most nmax iterations.

βαmax2

0

22max nn

=w

2

20

max αβ w

=n

Page 111: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

MORE READING

Page 112: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

Suggested Reading.

S. Haykin, “Neural Networks”, Prentice-Hall, 1999, chapter 3.L. Fausett, “Fundamentals of Neural Networks”,

Prentice-Hall, 1994, Chapter 2. R. O. Duda, P.E. Hart, and D.G. Stork, “Pattern

Classification”, 2nd edition, Wiley 2001. Appendix A4, chapter 2, and chapter 5.J.M. Zurada, “Introduction to Artificial Neural Systems”,

West Publishing Company, 1992, chapter 3.

Page 113: Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

References:

These lecture notes were based on the references of the previous slide, and the following references

1. Berlin Chen Lecture notes: Normal University, Taipei, Taiwan, ROC. http://140.122.185.120

2. Ehud Rivlin, IIT: http://webcourse.technion.ac.il/236607/Winter2002-2003/en/ho.html

3. Jin Hyung Kim, KAIST Computer Science Dept., CS679 Neural Network lecture notes http://ai.kaist.ac.kr/~jkim/cs679/detail.htm

4. Dr John A. Bullinaria, Course Material, Introduction to Neural Networks, http://www.cs.bham.ac.uk/~jxb/inn.html