Upload
erika-hensley
View
229
Download
0
Embed Size (px)
Citation preview
1
Problem: Consider a two class task with ω1, ω2
02211
0
...
0)(
wxwxwxw
wxwxg
ll
T
2121
0201
21
0
0
:hyperplanedecision on the Assume
x,x)xx(w
wxwwxw
x,x
T
TT
LINEAR CLASSIFIERS
2
Hence:
hyperplane on the w
0)( 0 wxwxg T
22
21
22
21
0 ,ww
)x(gz
ww
wd
3
The Perceptron Algorithm
Assume linearly separable classes, i.e.
The casefalls under the above formulation, since
•
•
2T
1T
x0x*w
x0x*w:*w
*T* wxw 0
1
0
x'x,
w
*w'w *
00 'x'wwx*w T*T
4
Our goal: Compute a solution, i.e., a hyperplane w,so that
• The steps– Define a cost function to be minimized– Choose an algorithm to minimize the cost
function– The minimum corresponds to a solution
2
1 0)(
xxwT
5
The Cost Function
• Where Y is the subset of the vectors wrongly classified by w. When Y=(empty set) a solution is achieved and
•
•
•
Yx
Tx xwwJ )()(
0)( wJ
2
1
and if 1
and if 1
xYx
xYx
x
x
0)( wJ
6
• J(w) is piecewise linear (WHY?)
The Algorithm• The philosophy of the gradient descent
is adopted.
7
• Wherever valid
•
This is the celebrated Perceptron Algorithm
old)()(
(old)(new)
www
wJw
www
xxwww
wJ
Yx Yxx
Tx
)(
)(
xtwtwYx
xt
)()1(
w
8
An example
The perceptron algorithm converges in a finite number of iteration steps to a solution if
)1(x)t(w
x)t(w)1t(w
xxt
t
t
ct
t
kk
t
t
kk
t
:e.g.
limlim0
2
0
9
A useful variant of the perceptron algorithm
It is a reward and punishment type of algorithm
otherwise )()1(
0)( ,)()1(
0)( , )()1(
2)(
)()(
1)(
)()(
twtw
x
xtwxtwtw
x
xtwxtwtw
t
tT
t
t
tT
t
10
The perceptron
shold thre
weightssynapticor synapses
0w
s'wi
It is a learning machine that learns from the training vectors via the perceptron algorithm
The network is called perceptron or neuron
11
Example: At some stage t the perceptron algorithm results in
The corresponding hyperplane is
5.0w,1w,1w 021
05.021 xx
5.0
51.0
42.1
1
75.0
2.0
)1(7.0
1
05.0
4.0
)1(7.0
5.0
1
1
)1(tw
ρ=0.7
12
Least Squares Methods If classes are linearly separable, the perceptron
output results in
If classes are NOT linearly separable, we shall compute the weights
so that the difference between• The actual output of the classifier, , and
• The desired outputs, e.g.
to be SMALL
1
021 w,...,w,w
xwT
2
1
if 1
if 1
x
x
13
SMALL, in the mean square error sense, means to choose so that the cost function
•
•
•
w
responses desired ingcorrespond the
)(minargˆ
minimum is ])[()( 2
y
wJw
xwyEwJ
w
T
14
Minimizing
where Rx the autocorrelation matrix
and the crosscorrelation vector
][ˆ
][][
)]([2
0])[()(
:in results to w.r.)(
1
2
yxERw
yxEwxxE
wxyxE
xwyEww
wJ
wwJ
x
T
T
T
][]...[][
...................................
][]...[][
][
21
12111
llll
lT
x
xxExxExxE
xxExxExxE
xxER
][
...
][
][1
yxE
yxE
yxE
l
15
SMALL in the sum error squares sense means
that is, the input xi and its
corresponding class label (±1).
),(
)()(
ii
2N
1ii
Ti
xy
xwywJ
: training pairs
i
N
ii
Ti
N
ii
N
ii
Ti
yxwxx
xwyww
wJ
11
1
2
)(
0)()(
iy
16
Pseudoinverse Matrix Define
responses desired ingcorrespond y
matrix) (an
N
1
TN
T2
T1
y
...
y
Nxl
x
...
x
x
X
matrix) (an lxN]x,...,x,x[X N21T
N
i
Tii
T xxXX1
i
N
ii
T yxyX
1
17
Thus
Assume N=l X square and invertible. Then
yX
yXXXw
yXwXX
yxwxx
TT
TT
N
i
N
iiii
Ti
1
1 1
)(ˆ
ˆ)(
)(ˆ)(
TT XXXX 1)( Pseudoinverse of X
1
111)(
XX
XXXXXXX TTTT
18
Assume N>l. Then, in general, there is no solution to satisfy all equations simultaneously:
The “solution” corresponds to the minimum sum of squares solution
unknowns l equations N ....
NTN
2T2
1T1
ywx
ywx
ywx
:ywX
yXw
19
Example:
5.0
7.0,
6.0
8.0,
4.0
7.0,
2.0
6.0,
6.0
4.0:
3.0
3.0,
7.0
2.0,
4.0
1.0,
5.0
6.0,
5.0
4.0:
2
1
y
..
..
..
..
..
..
..
..
..
..
X
1
1
1
1
1
1
1
1
1
1
15070
16080
14070
12060
16040
13030
17020
14010
15060
15040
20
34.1
24.0
13.3
)(
0.0
1.0
6.1
,
107.48.4
7.441.224.2
8.424.28.2
1 yXXXw
yXXX
TT
TT
21
The goal: Given two linearly separable classes, design the classifier
that leaves the maximum margin from both classes
0)( 0 wxwxg T
SUPPORT VECTOR MACHINES
22
Margin: Each hyperplane is characterized by• Its direction in space, i.e.,
• Its position in space, i.e.,
• For EACH direction,choose the hyperplane that leaves the SAME distance from the nearest points from each class. The margin is twice this distance.
w
0w
w
23
The distance of a point from a hyperplane is given by
Scale, so that at the nearest points from each class the discriminant function is ±1:
Thus the margin is given by
Also, the following is valid
x̂
, , 0ww
w
)x̂(gz x̂
21 for 1)( and for 1)x( 1)( xggxg
www
211
20
10
1
1
xwxw
xwxwT
T
24
SVM (linear) classifier
Minimize
Subject to
This is because minimizing
the margin is maximised
0)( wxwxg T
2
2
1)( wwJ
2for 1
for 1
ii
iii
x,y
,x,y
Niwxwy iT
i ,...,2,1 ,1)( 0
w
w
2
25
The above is a quadratic optimization task subject to a set of linear inequality constraints. The Karush-Kuhh-Tucker conditions, state that the minimizer satisfies:
• (1)
• (2)
• (3)
• (4)
• Where is the Lagrangian
0) ,L( 0 ,www
0) , , 00
ww(Lw
N,...,2,1i0i ,
Niwxwy iT
ii ,...,2,1 ,01)( 0
(.,.,.)L
)]([2
1) , ,( 0
10 wxwywwwwL i
Ti
N
ii
T
26
The solution: from the above, it turns out that
•
•
ii
N
ii xyw
1
01
i
N
ii y
27
Remarks:• The Lagrange multipliers can be either zero or
positive. Thus,
–
where , corresponding to positive Lagrange multipliers
– From constraint (4) above, i.e.,
the vectors contributing to
ii
N
ii xyw
s
1
0NN s
Niwxwy iT
ii ,...,2,1 ,0]1)([ 0
10 wxw iT
wsatisfy
28
– These vectors are known as SUPPORT VECTORS and are the closest vectors, from each class, to the classifier.
– Once is computed, is determined from conditions (4).
– The optimal hyperplane classifier of a support vector machine is UNIQUE.
w 0w
29
Dual Problem Formulation• The SVM formulation is a convex
programming problem, with– Convex cost function– Convex region of feasible solutions
• Thus, its solution can be achieved by its dual problem, i.e.,
– maximize
– subject to
),,( 0 wwL
0
01
1
i
N
ii
ii
N
ii
y
xyw
λ
30
• Combine the above to obtain
– maximize
– subject to
)2
1(
1j
Tijij
iji
N
ii xxyy
0
01
i
N
ii y
λ
31
Remarks:• Support vectors enter via inner products• Although the solution, w, is unique, the
Lagrange multipliers ARE NOT.
Non-Separable classes
32
In this case there is no hyperplane, such that
• Recall that the margin is defined as twice the distance between the following two hyperplanes
xwxwT ,1)(0
1
and
1
0
0
wxw
wxw
T
T
33
The training vectors belong to one of three possible categories
• A. Vectors outside the band which are correctly classified, i.e.,
• B. Vectors inside the band, and correctly classified, i.e.,
• C. Vectors misclassified, i.e.
1)( 0 wxwy Ti
1)(0 0 wxwy Ti
0)( 0 wxwy Ti
34
All three cases above can be represented as
A.B.C.
are known as slack variables
iT
i wxwy 1)( 0
i
i
i
1
10
0
i
35
The goal of the optimization is now two-fold• Maximize margin• Minimize the number of patterns with ,
That is
where C a constant and
• I(.) not differentiable. In practice, we use an approximation
•
• Following similar procedure as before we obtain
0i
N
iiICwwwJ
1
2
0 2
1) , ,(
00
01)(
i
iiI
N
iiCwwwJ
1
2
0 2
1) , ,(
36
KKT conditions
Ni
Ni
Niwxwy
NiC
yλ
xyλw
ii
ii
iiT
ii
ii
i
N
ii
ii
N
ii
,...,2,1 ,0, )6(
,...,2,1 ,0 )5(
,...,2,1 ,0]1)([ )4(
,...,2,1 ,0 )3(
0 )2(
)1(
0
1
1
37
The associated dual problem
Maximize
subject to
Remarks: The only difference with the separable
class case is the existence of C in the constraints
)2
1(
1 ,j
Tijij
N
i jiii xxyy
N
1iii
i
0y
N,...,2,1i,C0