7Enico/corsi/aa/LDF

Linear Discrimination Functions

Corso di Apprendimento AutomaticoLaurea Magistrale in Informatica

Nicola Fanizzi

Dipartimento di InformaticaUniversità degli Studi di Bari

November 4, 2009

Corso di Apprendimento Automatico Linear Discrimination Functions

Outline

Linear modelsGradient descentPerceptronMinimum square error approachLinear and logistic regression


Linear Discriminant Functions I

A linear discriminant function can be written as

g(x) = w1x1 + · · ·+ wdxd + w0 = ~w t~x + w0

where~w is the weight vectorw0 is the bias or threshold

A 2-class linear classifier implements the decision rule:

Decide ω1 if g(x) > 0 and ω2 if g(x) < 0


Linear Discriminant Functions II

The equation g(x) = 0 defines the decision surface thatseparates points assigned to ω1 from points assigned to ω2.

When g(x) is linear, this decision surface is a hyperplane (H).


Linear Discriminant Functions III

H divides the feature space into 2 half spaces:R1 for 1, and R2 for 2

If x1 and x2 are both on the decision surface

~w t~x1 + w0 = ~w t~x2 + w0 ⇒ ~w t (~x1 − ~x2) = 0

w is normal to any vector lying in the hyperplane


Linear Discriminant Functions IV

If we express ~x as

~x = ~xp + r~w||~w ||

where ~xp is the normal projection of ~x onto H, and r is thealgebraic distance from ~x to the hyperplane

Since g(~xp) = 0,we have g(~x) = ~w t~x + w0 = r ||~w || i.e. r = g(~x)

||~w ||

r is signed distance:r > 0 if ~x falls in R1, r < 0 if ~x falls in R2

Distance from the origin to the hyperplane is w0||~w ||


Linear Discriminant Functions V


Multi-category Case I

2 approaches to extend the LDF approach to the multi-categorycase:

ωi / not ωi Reduce the problem to c − 1 two-class problems:Problem #i : Find the functions that separatespoints assigned to ωi from those not assigned to ωi

ωi / ωj Find the c(c − 1)/2 linear discriminants,one for every pair of classes

Both approaches can lead to regions in which the classificationis undefined


Multi-category Case II


Pairwise Classification

Idea:build model for each pair of classes,using only training data from those classesProblem:solve c(c − 1)/2 classification problems for c classesTurns out not to be a problem in many cases becausetraining sets become small:

Assume data evenly distributed, i.e. 2n/c per learningproblem for n instances in totalSuppose learning algorithm is linear in nThen runtime of pairwise classification is proportional toc(c−1)

2 × 2nc = (c − 1)n


Linear Machine I

Define c linear discriminant functions:

gi(~x) = ~w ti ~x + wi0 i = 1, . . . , c

Linear Machine classifier: ~x ∈ ωi if gi(~x) > gj(~x) for all i 6= jIn case of equal scores, the classification is undefined

A LM divides the feature space into c decision regions,with gi(~x) the largest discriminant if ~x is in Ri

If Ri and Rj are contiguous, the boundary between them isa portion of the hyperplane Hij defined by:

gi(~x) = gj(~x) or (~wi − ~wj)t~x + (wi0 − wj0)


Linear Machine IIIt follows that ~wi − ~wj is normal to HijThe signed distance from ~x to Hij is:

gi(~x)− gj(~x)

||~wi − ~wj ||

There are c(c − 1)/2 pairs of convex regionsNot all regions are contiguous, and the total number ofsegments in the surfaces is often less than c(c − 1)/2

3- and 5-class problems


Generalized LDF I

The LDF is g(~x) = w0 +∑d

i=1 wixi

Adding d(d + 1)/2 terms involving the products of pairs ofcomponents of ~x , quadratic discriminant function:

g(~x) = w0 +d∑

i=1

wixi +d∑

i=1

d∑j=1

wijxixj

The separating surface defined by g(~x) = 0 is a second-degreeor hyperquadric surface

Add more terms, such as wijkxixjxk , we obtain polynomialdiscriminant functions


Generalized LDF II

The generalized LDF is defined

g(~x) =d̂∑

i=1

aiyi(~x) = ~at~y

where:~a is a d̂-dimensional weight vector andyi(~x) are arbitrary functions of ~x

The resulting discriminant function is not linear in ~x ,but it is linear in ~y

The functions yi(~x) map points in d-dimensional ~x-spaceto points in the d̂-dimensional ~y -space


Generalized LDF III

Example: Let the QDF be g(~x) = a1 + a2x + a3x2

The 3-dimensional vector is then y =

1xx2


2-class Linearly-Separable Case I

g(~x) =d∑

i=0

wixi = ~at~y

where x0 = 1 and~y t = [1 ~x ] = [1 x1 · · · xd ] is an augmented feature vector and~at = [w0 ~w ] = [w0 w1 · · · wd ] is an augmented weight vector

The hyperplane decision surface H defined ~at~y = 0passes through the origin in ~y -space

The distance from any point ~y to H is given by ~at~y||~a|| = g(~x)

||~a||

Because ~a =√

(1 + ||~w ||2) this distance is less then thedistance from ~x to H


2-class Linearly-Separable Case II

Problem: find [w0 ~w ] = ~a

Suppose that we have a set of n examples {~y1, . . . , ~yn}labeled ω1 or ω2

Look for a weight vector ~a that classifies all the examplescorrectly:

~at~yi > 0 and ~yi is labeled ω1 or~at~yi < 0 and ~yi is labeled ω2

If ~a exists, the examples are linearly separable


2-class Linearly-Separable Case III

SolutionsReplacing all the examples labeled ω2 by their negatives,one can look for a weight vector ~a such that ~at~yi > 0 for allthe examples~a a.k.a. separating vector or solution vector

Each example ~yi places a constraint on the possiblelocation of a solution vector~at~yi = 0 defines a hyperplane through the origin having ~yias a normal vector

The solution vector (if it exists) must be on the positive sideof every hyperplane

Solution Region = intersection of the n half-spaces


2-class Linearly-Separable Case IV

Any vector that lies in the solution region is a solutionvector: the solution vector (if it exists) is not uniqueAdditional requirements to find a solution vectorcloser to the middle of the region(i.e. more likely to classify new examples correctly)Seek a unit-length weight vector that maximizes theminimum distance from the examples to the hyperplane


2-class Linearly-Separable Case V

Seek the minimum-length weight vector satisfying

~at~yi ≥ b ≥ 0

The solution region shrinks by margin: b/||~yi ||


Gradient Descent I

Define a criterion function J(~a) that is minimized when ~a isa solution vector: ~at~yi ≥ 0, ∀i = 1, . . . ,n

Start with some arbitrary vector ~a(1)

Compute the gradient vector ∇J(~a(1))

The next value ~a(2) is obtained by moving a distance from~a(1) in the direction of steepest descenti.e. along the negative of the gradient

In general, ~a(k + 1) is obtained from ~a(k) using

~a(k + 1)← ~a(k)− η(k)∇J(~a(k))

where η(k) is the learning rate


Gradient Descent II


Gradient Descent & Delta Rule I

To understand, consider a simpler linear machine (a.k.a. unit),where

o = w0 + w1x1 + · · ·+ wnxn

Let’s learn wi ’s that minimize the squared error,i.e. J(w) = E [~w ]

E [~w ] ≡ 12

∑d∈D

(~td − ~od )2

where:D is set of training examples 〈~x , t〉t is the target output value


Gradient Descent & Delta Rule II

Gradient

∇E [~w ] ≡[∂E∂w0

,∂E∂w1

, · · · ∂E∂wn

]Training rule:

∆~w = −η∇E [~w ]

i.e.,

∆wi = −η ∂E∂wi

Note that η may be a constant


Gradient Descent & Delta Rule III

∂E∂wi

=∂

∂wi

12

∑d

(td − od )2

=12

∑d

∂

∂wi(td − od )2

=12

∑d

2(td − od )∂

∂wi(td − od )

=∑

d

(td − od )∂

∂wi(td − ~w · ~xd )

∂E∂wi

=∑

d

(td − od )(−xid )


Basic GRADIENT-DESCENT Algorithm

GRADIENT-DESCENT(D, η)D: training set,η: learning rate (e.g. .5)

Initialize each wi to some small random valueuntil the termination condition is met do

Initialize each ∆wi to zero.for each 〈~x , t〉 ∈ D do

Input the instance ~x to the unit and compute the output ofor each wi do

∆wi ← ∆wi + η(t − o)xi

for each weight wi do

wi ← wi + ∆wi


Incremental (Stochastic) GRADIENT DESCENT I

Approximation of the standard GRADIENT-DESCENT

Batch GRADIENT-DESCENT:Do until satisfied

1 Compute the gradient ∇ED[~w ]

2 ~w ← ~w − η∇ED[~w ]

Incremental GRADIENT-DESCENT:Do until satisfied

For each training example d in D1 Compute the gradient ∇Ed [~w ]2 ~w ← ~w − η∇Ed [~w ]


Incremental (Stochastic) GRADIENT DESCENT II

ED[~w ] ≡ 12

∑d∈D

(td − od )2

Ed [~w ] ≡ 12

(td − od )2

Training rule (delta rule):

∆wi ← η(t − o)xi

similar to perceptron training rule, yet unthresholdedconvergence is only asymptotically guaranteedlinear separability is no longer needed !


Standard vs. Stochastic GRADIENT-DESCENT

Incremental-GD can approximate Batch-GD arbitrarily closely ifη made small enough

error summed over all examples before summing updatedupon each examplestandard GD more costly per update step and can employlarger ηstochastic GD may avoid falling in local minima because ofusing Ed instead of ED


Newton’s Algorithm

J(~a) ' J(~a(k)) +∇J t (~a− ~a(k)) +12

(~a− ~a(k))tH(~a− ~a(k))

where H = ∂2J∂ai∂aj

is the Hessian matrix

Choose ~a(k + 1) to minimize this function:~a(k + 1)← ~a(k)− H−1∇J(~a)

Greater improvement per step than GD but not applicablewhen H is singularTime complexity O(d3)


Perceptron I

Assumption:data is linearly separable

Hyperplane:∑d

i=0 wixi = 0assuming that there is a constant attribute x0 = 1 (bias)

Algorithm for learning separating hyperplane:perceptron learning rule

Classifier:If∑d

i=0 wixi > 0 then predict ω1 (or +1),otherwise predict ω2 (or −1)


Perceptron II

Thresholded output

o(x1, . . . , xn) =

{+1 if w0 + w1x1 + · · ·+ wdxd > 0−1 otherwise.

Simpler vector notation: o(~x) = sgn(~x) =

{+1 if ~w~x > 0−1 otherwise.

Space of the hypotheses: {~w | ~w ∈ Rn}


Decision Surface of a Perceptron

Can represent some useful functionsWhat weights represent g(x1, x2) = AND(x1, x2)?

But some functions not representablee.g., not linearly separable (XOR)Therefore, we’ll want networks of these...


Perceptron Training Rule I

Perceptron criterion function: J(~a) =∑

~y∈Y (~a)(−~at~y)

where Y (~a) is the set of examples misclassified by ~aIf no examples are misclassified, Y (~a) is empty andJ(~a) = 0 (i.e. ~a is a solution vector)J(~a) ≥ 0, since ~at~yi ≤ 0 if ~yi is misclassified

Geometrically, J(~a) is proportional to the sum of thedistances from the misclassified examples to the decisionboundary

Since ∇J =∑

~y∈Y (~a)(−~y) the update rule becomes

~a(k + 1)← ~a(k) + η(k)∑

~y∈Yk (~a)

~y

where Y (~a) is the set of examples misclassified by ~a(k)


Perceptron Training I

Set all coefficient ai to zerodo

for each instance y in the training dataif y is classified incorrectly by the perceptron

if y belongs to ω1 add it to ~aelse subtract it from ~a

until all instances in the training data are classified correctlyreturn ~a


Perceptron Training II

BATCH PERCEPTRON TRAINING

Initialize ~a, η, θ, k ← 0do

k ← k + 1~a← ~a + η(k)

∑~y∈Yk

~yuntil | η(k)

∑~y∈Yk

|< θ

return ~a

Can prove it will converge

If training data is linearly separable

and η sufficiently small


Perceptron Training III

Why does this work?Consider situation where an instance pertaining to the firstclass has been added:

(a0 + y0)y0 + (a1 + y1)y1 + (a2 + y2)y2 + . . .+ (ad + ad )yd

This means output for ~a has increased by:

y0y0 + y1y1 + y2y2 + . . .+ ydyd

always positive,thus the hyperplane has moved into the correct direction(and output decreases for instances of other class)


Perceptron Training IV

η = 1 and ~a(1) = ~0.Sequence of misclassified instances: ~y1 + ~y2 + ~y3, ~y2, ~y3, ~y1, ~y3 stop


Perceptron

SimplificationFIXED-INCREMENT SINGLE-EXAMPLE PERCEPTRON

input: {~y (k)}nk=1 training examplesbegin initialize ~a, k = 0

do k ← (k + 1) mod nif ~y (k) is misclassified by the model based on ~athen ~a← ~a + ~y (k)

until all examples properly classifiedreturn ~a

end


Generalizations I

VARIABLE-INCREMENT PERCEPTRON WITH MARGIN

begininitialize ~a, θ, margin b, η, k ← 0do

k ← (k + 1) mod nif ~at~y (k) ≤ b then ~a← ~a + ~y (k)

until ~at~y (k) > b for all kreturn ~a

end


Generalizations II

BATCH VARIABLE-INCREMENT PERCEPTRON

begininitialize ~a, η, k ← 0do

k ← (k + 1) mod nYk ← ∅j ← 0do

j ← j + 1if yj misclassified then Yk ← Yk ∪ {yj}

until j = n~a← ~a +

∑~y∈Yk

~yuntil Yk = ∅return ~a

end


Comments

Perceptron adjusts the parameters only when it encountersan error, i.e. a misclassified training exampleCorrectly classified examples can be ignoredThe learning rate η can be chosen arbitrarily,it will only impact on the norm of the final ~a(and the corresponding magnitude of a0)The final weight vector ~a is a linear combination of trainingpoints


Nonseparable Case

The Perceptron is an error correcting procedure convergeswhen the examples are linearly separableEven if a separating vector is found for the trainingexamples, it does not follow that the resulting classifier willperform well on independent test dataTo ensure that the performance on training and test datawill be similar, many training examples should be used.Sufficiently large training examples are almost certainlynon linearly separableNo weight vector can correctly classify every example in anonseparable set

The corrections may never cease if set is nonseparable


Learning rate

If we choose η(k)→ 0 as k →∞ then performance can beacceptable on non-separable problems while preservingthe ability to find a solution on separable problemsη(k) can be considered as a function of recentperformance, decreasing it as performance improves: e.g.η(k)← η/kThe rate at which η(k) approaches zero is important:

Too slow: result will be sensitive to those examples that render the setnon-separable

Too fast: may converge prematurely with sub-optimal results


Linear Models: WINNOW

Another mistake-driven algorithm for finding a separatinghyperplaneAssumes binary attributes (i.e. propositional variables)Main difference: multiplicative instead of additive updates

Weights are multiplied by a parameter α > 1 (or its inverse)

Another difference: user-specified threshold parameter θPredict first class if w0 + w1x1 + w2x2 + · · ·+ wkxk > θ


The Algorithm I

WINNOW

initialize ~a, αwhile some instances are misclassified

for each instance ~y in the training dataclassify ~y using the current model ~aif the predicted class is incorrect

if y belongs to the target classfor each attribute yi = 1, multiply ai by α(if yi = 0, ai is left unchanged)

otherwisefor each attribute yi = 1, divide ai by α(if yi = 0, ai is left unchanged)


The Algorithm II

WINNOW is very effective in homing in on relevant features(it is attribute efficient)

Can also be used in an on-line setting in which newinstances arrive continuously(like the perceptron algorithm)


Balanced WINNOW I

WINNOW doesn’t allow negative weights and this can be adrawback in some applications

BALANCED WINNOW maintains two weight vectors, one foreach class: a+ and a−

Instance is classified as belonging to the first class (of twoclasses) if:(a+

0 −a−0 )+(a+1 −a−1 )y1 +(a+

2 −a−2 )y2 +· · ·+(a+k −a−k )yk > θ


Balanced WINNOW II

BALANCED WINNOW

while some instances are misclassifiedfor each instance a in the training data

classify a using the current weightsif the predicted class is incorrect

if a belongs to the first classfor each attribute yi = 1,

multiply a+i by α and divide a−i by α

(if yi = 0, leave a+i and a−i unchanged)

otherwisefor each attribute yi = 1,

multiply a−i by α and divide a+i by α

(if yi = 0, leave a+i and a−i unchanged)


Minimum Squared Error Approach I

Minimum Squared Error (MSE)It trades the ability to obtain a separating vector for goodperformance on both separable and non-separableproblemsPreviously, we sought a weight vector ~a making all of theinner products ~at~y ≥ 0In the MSE procedure, one tries to make ~at~yi = bi ,where bi are some arbitrarily specified positive constants

Using matrix notation: Y~a = ~bIf Y is nonsingular, then ~a = Y−1~bUnfortunately Y is not a square matrix, usually with morerows than columnsWhen there are more equations than unknowns, ~a isoverdetermined, and ordinarily no exact solution exists.


Minimum Squared Error Approach II

We can seek a weight vector ~a that minimizes somefunction of an error vector ~e = Y~a− ~bMinimizing the squared length of the error vector isequivalent to minimizing the sum-of-squared-error criterionfunction

J(~a) = ||Y~a− ~b||2 =n∑

i=1

(~at~yi − bi)2

whose gradient is

∇J = 2n∑

i=1

(~at~yi − bi)~yi = 2Y t (Y~a− ~b)

Setting the gradient equal to zero, the following necessarycondition holds: Y tY~a = Y t~b


Minimum Squared Error Approach III

Y tY is a square matrix which is often nonsingular.Therefore, solving for ~a:

~a = (Y tY )−1Y t~b = Y +~b

where Y + = (Y tY )−1Y t is the pseudo-inverse of Y

Y + can be written also as limε→0(Y tY + εI)−1Y t

and it can be shown that this limit always exists, hence

~a = Y +~b

the MSE solution to the problem Y~a = ~b


WIDROW-HOFF procedure a.k.a. LMS I

The criterion function J(~a) = ||Y~a− ~b||2 could beminimized by a gradient descent procedureAdvantages:

Avoids the problems that arise when Y tY is singularAvoids the need for working with large matrices

Since ∇J = 2Y t (Y~a− ~b) a simple update rule would be{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)(Y~a− ~b)

or, if we consider the examples sequentially{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)

[bk − ~a(k)t~y(k)

]~y(k)


WIDROW-HOFF procedure a.k.a. LMS II

LMS({~yi}ni=1)

input {~yi}ni=1: training examplesbeginInitialize ~a, ~b, θ, η(·), k ← 0

do k ← k + 1 mod n~a← ~a + η(k)(bk − ~a(k)t~y(k))~y(k)

until |η(k)(bk − ~a(k)t~y(k))~y(k)| < θreturn ~aend


Summary

Perceptron training rule guaranteed to succeed ifTraining examples are linearly separableSufficiently small learning rate η

Linear unit training rule uses gradient descentGuaranteed to converge to hypothesis with MSEGiven sufficiently small learning rate ηEven when training data contains noiseEven when training data not separable by H


Linear Regression

Standard technique for numeric predictionOutcome is linear combination of attributes:

x = w0 + w1x1 + w2x2 + · · ·+ wdxd

Weights are calculated from the training datastandard math algorithms ~w

Predicted value for first training instance ~x (1)

w0 + w1x (1)1 + w2x (1)

2 + · · ·+ wdx (1)d =

d∑j=0

wjx(1)j

assuming extended vectors with x0 = 1


Probabilistic Classification

Multiresponse Linear Regression (MLR)Any regression technique can be used for classification

Training:perform a regression for each class→ gi linearcompute each linear expression for each class,setting the output to 1 for training instances that belong tothe class and 0 for those that don’tPrediction:predict class corresponding to model with largest outputvalue (membership value)


Logistic Regression I

MLR drawbacks1 membership values are not proper probabilities

they can fall outside [0,1]

2 least squares regression assumes that: the errors are notonly statistically independent,but are also normally distributed with the same standarddeviation

Logit transformation does not suffer from these problems

Builds a linear model for a transformed target variable

Assume we have two classes


Logistic Regression II

Logistic regression replaces the target

Pr(1 | ~x)

that cannot be approximated well using a linear functionwith this target

log(

Pr(1 | ~x)

1− Pr(1 | ~x)

)

Transformation maps [0,1] to (−∞,+∞)


Logistic Regression III

logit tranformation function


Example: Logistic Regression Model

Resulting model: Pr(1 | ~y) = 1/(1 + e−(a0+a1y1+a2y2+···+ad yd )

)Example: Model with a0 = 0.5 and a1 = 1:

Parameters induced from data using maximum likelihoodCorso di Apprendimento Automatico Linear Discrimination Functions

Maximum Likelihood

Aim: maximize probability of training data with respect tothe parametersCan use logarithms of probabilities and maximizelog-likelihood of model and MSE:

n∑i=1

(1− x (i)

)log(

1− Pr(1 | ~y (i)))

+x (i) log(

1− Pr(1 | ~y (i)))

where the x (i)’s are the responses (either 0 or 1)

Weights ai need to be chosen to maximize log-likelihoodrelatively simple method:iteratively re-weighted least squares


Credits

R. Duda, P. Hart, D. Stork: Pattern Classification, WileyT. M. Mitchell: Machine Learning, McGraw HillI. Witten & E. Frank: Data Mining: Practical MachineLearning Tools and Techniques, Morgan Kaufmann


http://rii.ricoh.com/%7Estork/DHS.html

http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html

http://www.cs.waikato.ac.nz/~ml/weka/book.html

http://www.cs.waikato.ac.nz/~ml/weka/book.html

Documents

7Enico/corsi/aa/LDF