1 Problem: Consider a two class task with ω 1, ω 2 LINEAR CLASSIFIERS

1

Problem: Consider a two class task with ω1, ω2

02211

0

...

0)(

wxwxwxw

wxwxg

ll

T

2121

0201

21

0

0

:hyperplanedecision on the Assume

x,x)xx(w

wxwwxw

x,x

T

TT

LINEAR CLASSIFIERS

2

Hence:

hyperplane on the w

0)( 0 wxwxg T

22

21

22

21

0 ,ww

)x(gz

ww

wd

3

The Perceptron Algorithm

Assume linearly separable classes, i.e.

The casefalls under the above formulation, since

•

•

2T

1T

x0x*w

x0x*w:*w

*T* wxw 0

1

0

x'x,

w

*w'w *

00 'x'wwx*w T*T

4

Our goal: Compute a solution, i.e., a hyperplane w,so that

• The steps– Define a cost function to be minimized– Choose an algorithm to minimize the cost

function– The minimum corresponds to a solution

2

1 0)(

xxwT

5

The Cost Function

• Where Y is the subset of the vectors wrongly classified by w. When Y=(empty set) a solution is achieved and

•

•

•

Yx

Tx xwwJ )()(

0)( wJ

2

1

and if 1

and if 1

xYx

xYx

x

x

0)( wJ

6

• J(w) is piecewise linear (WHY?)

The Algorithm• The philosophy of the gradient descent

is adopted.

7

• Wherever valid

•

This is the celebrated Perceptron Algorithm

old)()(

(old)(new)

www

wJw

www

xxwww

wJ

Yx Yxx

Tx

)(

)(

xtwtwYx

xt

)()1(

w

8

An example

The perceptron algorithm converges in a finite number of iteration steps to a solution if

)1(x)t(w

x)t(w)1t(w

xxt

t

t

ct

t

kk

t

t

kk

t

:e.g.

limlim0

2

0

9

A useful variant of the perceptron algorithm

It is a reward and punishment type of algorithm

otherwise )()1(

0)( ,)()1(

0)( , )()1(

2)(

)()(

1)(

)()(

twtw

x

xtwxtwtw

x

xtwxtwtw

t

tT

t

t

tT

t

10

The perceptron

shold thre

weightssynapticor synapses

0w

s'wi

It is a learning machine that learns from the training vectors via the perceptron algorithm

The network is called perceptron or neuron

11

Example: At some stage t the perceptron algorithm results in

The corresponding hyperplane is

5.0w,1w,1w 021

05.021 xx

5.0

51.0

42.1

1

75.0

2.0

)1(7.0

1

05.0

4.0

)1(7.0

5.0

1

1

)1(tw

ρ=0.7

12

Least Squares Methods If classes are linearly separable, the perceptron

output results in

If classes are NOT linearly separable, we shall compute the weights

so that the difference between• The actual output of the classifier, , and

• The desired outputs, e.g.

to be SMALL

1

021 w,...,w,w

xwT

2

1

if 1

if 1

x

x

13

SMALL, in the mean square error sense, means to choose so that the cost function

•

•

•

w

responses desired ingcorrespond the

)(minargˆ

minimum is ])[()( 2

y

wJw

xwyEwJ

w

T

14

Minimizing

where Rx the autocorrelation matrix

and the crosscorrelation vector

][ˆ

][][

)]([2

0])[()(

:in results to w.r.)(

1

2

yxERw

yxEwxxE

wxyxE

xwyEww

wJ

wwJ

x

T

T

T

][]...[][

...................................

][]...[][

][

21

12111

llll

lT

x

xxExxExxE

xxExxExxE

xxER

][

...

][

][1

yxE

yxE

yxE

l

15

SMALL in the sum error squares sense means

that is, the input xi and its

corresponding class label (±1).

),(

)()(

ii

2N

1ii

Ti

xy

xwywJ

: training pairs

i

N

ii

Ti

N

ii

N

ii

Ti

yxwxx

xwyww

wJ

11

1

2

)(

0)()(

iy

16

Pseudoinverse Matrix Define

responses desired ingcorrespond y

matrix) (an

N

1

TN

T2

T1

y

...

y

Nxl

x

...

x

x

X

matrix) (an lxN]x,...,x,x[X N21T

N

i

Tii

T xxXX1

i

N

ii

T yxyX

1

17

Thus

Assume N=l X square and invertible. Then

yX

yXXXw

yXwXX

yxwxx

TT

TT

N

i

N

iiii

Ti

1

1 1

)(ˆ

ˆ)(

)(ˆ)(

TT XXXX 1)( Pseudoinverse of X

1

111)(

XX

XXXXXXX TTTT

18

Assume N>l. Then, in general, there is no solution to satisfy all equations simultaneously:

The “solution” corresponds to the minimum sum of squares solution

unknowns l equations N ....

NTN

2T2

1T1

ywx

ywx

ywx

:ywX

yXw

19

Example:

5.0

7.0,

6.0

8.0,

4.0

7.0,

2.0

6.0,

6.0

4.0:

3.0

3.0,

7.0

2.0,

4.0

1.0,

5.0

6.0,

5.0

4.0:

2

1

y

..

..

..

..

..

..

..

..

..

..

X

1

1

1

1

1

1

1

1

1

1

15070

16080

14070

12060

16040

13030

17020

14010

15060

15040

20

34.1

24.0

13.3

)(

0.0

1.0

6.1

,

107.48.4

7.441.224.2

8.424.28.2

1 yXXXw

yXXX

TT

TT

21

The goal: Given two linearly separable classes, design the classifier

that leaves the maximum margin from both classes

0)( 0 wxwxg T

SUPPORT VECTOR MACHINES

22

Margin: Each hyperplane is characterized by• Its direction in space, i.e.,

• Its position in space, i.e.,

• For EACH direction,choose the hyperplane that leaves the SAME distance from the nearest points from each class. The margin is twice this distance.

w

0w

w

23

The distance of a point from a hyperplane is given by

Scale, so that at the nearest points from each class the discriminant function is ±1:

Thus the margin is given by

Also, the following is valid

x̂

, , 0ww

w

)x̂(gz x̂

21 for 1)( and for 1)x( 1)( xggxg

www

211

20

10

1

1

xwxw

xwxwT

T

24

SVM (linear) classifier

Minimize

Subject to

This is because minimizing

the margin is maximised

0)( wxwxg T

2

2

1)( wwJ

2for 1

for 1

ii

iii

x,y

,x,y

Niwxwy iT

i ,...,2,1 ,1)( 0

w

w

2

25

The above is a quadratic optimization task subject to a set of linear inequality constraints. The Karush-Kuhh-Tucker conditions, state that the minimizer satisfies:

• (1)

• (2)

• (3)

• (4)

• Where is the Lagrangian

0) ,L( 0 ,www

0) , , 00

ww(Lw

N,...,2,1i0i ,

Niwxwy iT

ii ,...,2,1 ,01)( 0

(.,.,.)L

)]([2

1) , ,( 0

10 wxwywwwwL i

Ti

N

ii

T

26

The solution: from the above, it turns out that

•

•

ii

N

ii xyw

1

01

i

N

ii y

27

Remarks:• The Lagrange multipliers can be either zero or

positive. Thus,

–

where , corresponding to positive Lagrange multipliers

– From constraint (4) above, i.e.,

the vectors contributing to

ii

N

ii xyw

s

1

0NN s

Niwxwy iT

ii ,...,2,1 ,0]1)([ 0

10 wxw iT

wsatisfy

28

– These vectors are known as SUPPORT VECTORS and are the closest vectors, from each class, to the classifier.

– Once is computed, is determined from conditions (4).

– The optimal hyperplane classifier of a support vector machine is UNIQUE.

w 0w

29

Dual Problem Formulation• The SVM formulation is a convex

programming problem, with– Convex cost function– Convex region of feasible solutions

• Thus, its solution can be achieved by its dual problem, i.e.,

– maximize

– subject to

),,( 0 wwL

0

01

1

i

N

ii

ii

N

ii

y

xyw

λ

30

• Combine the above to obtain

– maximize

– subject to

)2

1(

1j

Tijij

iji

N

ii xxyy

0

01

i

N

ii y

λ

31

Remarks:• Support vectors enter via inner products• Although the solution, w, is unique, the

Lagrange multipliers ARE NOT.

Non-Separable classes

32

In this case there is no hyperplane, such that

• Recall that the margin is defined as twice the distance between the following two hyperplanes

xwxwT ,1)(0

1

and

1

0

0

wxw

wxw

T

T

33

The training vectors belong to one of three possible categories

• A. Vectors outside the band which are correctly classified, i.e.,

• B. Vectors inside the band, and correctly classified, i.e.,

• C. Vectors misclassified, i.e.

1)( 0 wxwy Ti

1)(0 0 wxwy Ti

0)( 0 wxwy Ti

34

All three cases above can be represented as

A.B.C.

are known as slack variables

iT

i wxwy 1)( 0

i

i

i

1

10

0

i

35

The goal of the optimization is now two-fold• Maximize margin• Minimize the number of patterns with ,

That is

where C a constant and

• I(.) not differentiable. In practice, we use an approximation

•

• Following similar procedure as before we obtain

0i

N

iiICwwwJ

1

2

0 2

1) , ,(

00

01)(

i

iiI

N

iiCwwwJ

1

2

0 2

1) , ,(

36

KKT conditions

Ni

Ni

Niwxwy

NiC

yλ

xyλw

ii

ii

iiT

ii

ii

i

N

ii

ii

N

ii

,...,2,1 ,0, )6(

,...,2,1 ,0 )5(

,...,2,1 ,0]1)([ )4(

,...,2,1 ,0 )3(

0 )2(

)1(

0

1

1

37

The associated dual problem

Maximize

subject to

Remarks: The only difference with the separable

class case is the existence of C in the constraints

)2

1(

1 ,j

Tijij

N

i jiii xxyy

N

1iii

i

0y

N,...,2,1i,C0

Documents

1 Problem: Consider a two class task with ω 1, ω 2 LINEAR CLASSIFIERS