63
Linear Discrimination Functions Corso di Apprendimento Automatico Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Corso di Apprendimento Automatico Linear Discrimination Functions

nico/corsi/aa/LDF

Embed Size (px)

DESCRIPTION

http://lacam.di.uniba.it:8000/~nico/corsi/aa/LDF.pdf

Citation preview

Page 1: nico/corsi/aa/LDF

Linear Discrimination Functions

Corso di Apprendimento AutomaticoLaurea Magistrale in Informatica

Nicola Fanizzi

Dipartimento di InformaticaUniversità degli Studi di Bari

November 4, 2009

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 2: nico/corsi/aa/LDF

Outline

Linear modelsGradient descentPerceptronMinimum square error approachLinear and logistic regression

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 3: nico/corsi/aa/LDF

Linear Discriminant Functions I

A linear discriminant function can be written as

g(x) = w1x1 + · · ·+ wdxd + w0 = ~w t~x + w0

where~w is the weight vectorw0 is the bias or threshold

A 2-class linear classifier implements the decision rule:

Decide ω1 if g(x) > 0 and ω2 if g(x) < 0

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 4: nico/corsi/aa/LDF

Linear Discriminant Functions II

The equation g(x) = 0 defines the decision surface thatseparates points assigned to ω1 from points assigned to ω2.

When g(x) is linear, this decision surface is a hyperplane (H).

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 5: nico/corsi/aa/LDF

Linear Discriminant Functions III

H divides the feature space into 2 half spaces:R1 for 1, and R2 for 2

If x1 and x2 are both on the decision surface

~w t~x1 + w0 = ~w t~x2 + w0 ⇒ ~w t (~x1 − ~x2) = 0

w is normal to any vector lying in the hyperplane

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 6: nico/corsi/aa/LDF

Linear Discriminant Functions IV

If we express ~x as

~x = ~xp + r~w||~w ||

where ~xp is the normal projection of ~x onto H, and r is thealgebraic distance from ~x to the hyperplane

Since g(~xp) = 0,we have g(~x) = ~w t~x + w0 = r ||~w || i.e. r = g(~x)

||~w ||

r is signed distance:r > 0 if ~x falls in R1, r < 0 if ~x falls in R2

Distance from the origin to the hyperplane is w0||~w ||

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 7: nico/corsi/aa/LDF

Linear Discriminant Functions V

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 8: nico/corsi/aa/LDF

Multi-category Case I

2 approaches to extend the LDF approach to the multi-categorycase:

ωi / not ωi Reduce the problem to c − 1 two-class problems:Problem #i : Find the functions that separatespoints assigned to ωi from those not assigned to ωi

ωi / ωj Find the c(c − 1)/2 linear discriminants,one for every pair of classes

Both approaches can lead to regions in which the classificationis undefined

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 9: nico/corsi/aa/LDF

Multi-category Case II

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 10: nico/corsi/aa/LDF

Pairwise Classification

Idea:build model for each pair of classes,using only training data from those classesProblem:solve c(c − 1)/2 classification problems for c classesTurns out not to be a problem in many cases becausetraining sets become small:

Assume data evenly distributed, i.e. 2n/c per learningproblem for n instances in totalSuppose learning algorithm is linear in nThen runtime of pairwise classification is proportional toc(c−1)

2 × 2nc = (c − 1)n

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 11: nico/corsi/aa/LDF

Linear Machine I

Define c linear discriminant functions:

gi(~x) = ~w ti ~x + wi0 i = 1, . . . , c

Linear Machine classifier: ~x ∈ ωi if gi(~x) > gj(~x) for all i 6= jIn case of equal scores, the classification is undefined

A LM divides the feature space into c decision regions,with gi(~x) the largest discriminant if ~x is in Ri

If Ri and Rj are contiguous, the boundary between them isa portion of the hyperplane Hij defined by:

gi(~x) = gj(~x) or (~wi − ~wj)t~x + (wi0 − wj0)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 12: nico/corsi/aa/LDF

Linear Machine IIIt follows that ~wi − ~wj is normal to HijThe signed distance from ~x to Hij is:

gi(~x)− gj(~x)

||~wi − ~wj ||

There are c(c − 1)/2 pairs of convex regionsNot all regions are contiguous, and the total number ofsegments in the surfaces is often less than c(c − 1)/2

3- and 5-class problems

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 13: nico/corsi/aa/LDF

Generalized LDF I

The LDF is g(~x) = w0 +∑d

i=1 wixi

Adding d(d + 1)/2 terms involving the products of pairs ofcomponents of ~x , quadratic discriminant function:

g(~x) = w0 +d∑

i=1

wixi +d∑

i=1

d∑j=1

wijxixj

The separating surface defined by g(~x) = 0 is a second-degreeor hyperquadric surface

Add more terms, such as wijkxixjxk , we obtain polynomialdiscriminant functions

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 14: nico/corsi/aa/LDF

Generalized LDF II

The generalized LDF is defined

g(~x) =d̂∑

i=1

aiyi(~x) = ~at~y

where:~a is a d̂-dimensional weight vector andyi(~x) are arbitrary functions of ~x

The resulting discriminant function is not linear in ~x ,but it is linear in ~y

The functions yi(~x) map points in d-dimensional ~x-spaceto points in the d̂-dimensional ~y -space

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 15: nico/corsi/aa/LDF

Generalized LDF III

Example: Let the QDF be g(~x) = a1 + a2x + a3x2

The 3-dimensional vector is then y =

1xx2

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 16: nico/corsi/aa/LDF

2-class Linearly-Separable Case I

g(~x) =d∑

i=0

wixi = ~at~y

where x0 = 1 and~y t = [1 ~x ] = [1 x1 · · · xd ] is an augmented feature vector and~at = [w0 ~w ] = [w0 w1 · · · wd ] is an augmented weight vector

The hyperplane decision surface H defined ~at~y = 0passes through the origin in ~y -space

The distance from any point ~y to H is given by ~at~y||~a|| = g(~x)

||~a||

Because ~a =√

(1 + ||~w ||2) this distance is less then thedistance from ~x to H

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 17: nico/corsi/aa/LDF

2-class Linearly-Separable Case II

Problem: find [w0 ~w ] = ~a

Suppose that we have a set of n examples {~y1, . . . , ~yn}labeled ω1 or ω2

Look for a weight vector ~a that classifies all the examplescorrectly:

~at~yi > 0 and ~yi is labeled ω1 or~at~yi < 0 and ~yi is labeled ω2

If ~a exists, the examples are linearly separable

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 18: nico/corsi/aa/LDF

2-class Linearly-Separable Case III

SolutionsReplacing all the examples labeled ω2 by their negatives,one can look for a weight vector ~a such that ~at~yi > 0 for allthe examples~a a.k.a. separating vector or solution vector

Each example ~yi places a constraint on the possiblelocation of a solution vector~at~yi = 0 defines a hyperplane through the origin having ~yias a normal vector

The solution vector (if it exists) must be on the positive sideof every hyperplane

Solution Region = intersection of the n half-spaces

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 19: nico/corsi/aa/LDF

2-class Linearly-Separable Case IV

Any vector that lies in the solution region is a solutionvector: the solution vector (if it exists) is not uniqueAdditional requirements to find a solution vectorcloser to the middle of the region(i.e. more likely to classify new examples correctly)Seek a unit-length weight vector that maximizes theminimum distance from the examples to the hyperplane

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 20: nico/corsi/aa/LDF

2-class Linearly-Separable Case V

Seek the minimum-length weight vector satisfying

~at~yi ≥ b ≥ 0

The solution region shrinks by margin: b/||~yi ||

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 21: nico/corsi/aa/LDF

Gradient Descent I

Define a criterion function J(~a) that is minimized when ~a isa solution vector: ~at~yi ≥ 0, ∀i = 1, . . . ,n

Start with some arbitrary vector ~a(1)

Compute the gradient vector ∇J(~a(1))

The next value ~a(2) is obtained by moving a distance from~a(1) in the direction of steepest descenti.e. along the negative of the gradient

In general, ~a(k + 1) is obtained from ~a(k) using

~a(k + 1)← ~a(k)− η(k)∇J(~a(k))

where η(k) is the learning rate

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 22: nico/corsi/aa/LDF

Gradient Descent II

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 23: nico/corsi/aa/LDF

Gradient Descent & Delta Rule I

To understand, consider a simpler linear machine (a.k.a. unit),where

o = w0 + w1x1 + · · ·+ wnxn

Let’s learn wi ’s that minimize the squared error,i.e. J(w) = E [~w ]

E [~w ] ≡ 12

∑d∈D

(~td − ~od )2

where:D is set of training examples 〈~x , t〉t is the target output value

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 24: nico/corsi/aa/LDF

Gradient Descent & Delta Rule II

Gradient

∇E [~w ] ≡[∂E∂w0

,∂E∂w1

, · · · ∂E∂wn

]Training rule:

∆~w = −η∇E [~w ]

i.e.,

∆wi = −η ∂E∂wi

Note that η may be a constant

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 25: nico/corsi/aa/LDF

Gradient Descent & Delta Rule III

∂E∂wi

=∂

∂wi

12

∑d

(td − od )2

=12

∑d

∂wi(td − od )2

=12

∑d

2(td − od )∂

∂wi(td − od )

=∑

d

(td − od )∂

∂wi(td − ~w · ~xd )

∂E∂wi

=∑

d

(td − od )(−xid )

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 26: nico/corsi/aa/LDF

Basic GRADIENT-DESCENT Algorithm

GRADIENT-DESCENT(D, η)D: training set,η: learning rate (e.g. .5)

Initialize each wi to some small random valueuntil the termination condition is met do

Initialize each ∆wi to zero.for each 〈~x , t〉 ∈ D do

Input the instance ~x to the unit and compute the output ofor each wi do

∆wi ← ∆wi + η(t − o)xi

for each weight wi do

wi ← wi + ∆wi

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 27: nico/corsi/aa/LDF

Incremental (Stochastic) GRADIENT DESCENT I

Approximation of the standard GRADIENT-DESCENT

Batch GRADIENT-DESCENT:Do until satisfied

1 Compute the gradient ∇ED[~w ]

2 ~w ← ~w − η∇ED[~w ]

Incremental GRADIENT-DESCENT:Do until satisfied

For each training example d in D1 Compute the gradient ∇Ed [~w ]2 ~w ← ~w − η∇Ed [~w ]

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 28: nico/corsi/aa/LDF

Incremental (Stochastic) GRADIENT DESCENT II

ED[~w ] ≡ 12

∑d∈D

(td − od )2

Ed [~w ] ≡ 12

(td − od )2

Training rule (delta rule):

∆wi ← η(t − o)xi

similar to perceptron training rule, yet unthresholdedconvergence is only asymptotically guaranteedlinear separability is no longer needed !

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 29: nico/corsi/aa/LDF

Standard vs. Stochastic GRADIENT-DESCENT

Incremental-GD can approximate Batch-GD arbitrarily closely ifη made small enough

error summed over all examples before summing updatedupon each examplestandard GD more costly per update step and can employlarger ηstochastic GD may avoid falling in local minima because ofusing Ed instead of ED

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 30: nico/corsi/aa/LDF

Newton’s Algorithm

J(~a) ' J(~a(k)) +∇J t (~a− ~a(k)) +12

(~a− ~a(k))tH(~a− ~a(k))

where H = ∂2J∂ai∂aj

is the Hessian matrix

Choose ~a(k + 1) to minimize this function:~a(k + 1)← ~a(k)− H−1∇J(~a)

Greater improvement per step than GD but not applicablewhen H is singularTime complexity O(d3)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 31: nico/corsi/aa/LDF

Perceptron I

Assumption:data is linearly separable

Hyperplane:∑d

i=0 wixi = 0assuming that there is a constant attribute x0 = 1 (bias)

Algorithm for learning separating hyperplane:perceptron learning rule

Classifier:If∑d

i=0 wixi > 0 then predict ω1 (or +1),otherwise predict ω2 (or −1)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 32: nico/corsi/aa/LDF

Perceptron II

Thresholded output

o(x1, . . . , xn) =

{+1 if w0 + w1x1 + · · ·+ wdxd > 0−1 otherwise.

Simpler vector notation: o(~x) = sgn(~x) =

{+1 if ~w~x > 0−1 otherwise.

Space of the hypotheses: {~w | ~w ∈ Rn}

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 33: nico/corsi/aa/LDF

Decision Surface of a Perceptron

Can represent some useful functionsWhat weights represent g(x1, x2) = AND(x1, x2)?

But some functions not representablee.g., not linearly separable (XOR)Therefore, we’ll want networks of these...

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 34: nico/corsi/aa/LDF

Perceptron Training Rule I

Perceptron criterion function: J(~a) =∑

~y∈Y (~a)(−~at~y)

where Y (~a) is the set of examples misclassified by ~aIf no examples are misclassified, Y (~a) is empty andJ(~a) = 0 (i.e. ~a is a solution vector)J(~a) ≥ 0, since ~at~yi ≤ 0 if ~yi is misclassified

Geometrically, J(~a) is proportional to the sum of thedistances from the misclassified examples to the decisionboundary

Since ∇J =∑

~y∈Y (~a)(−~y) the update rule becomes

~a(k + 1)← ~a(k) + η(k)∑

~y∈Yk (~a)

~y

where Y (~a) is the set of examples misclassified by ~a(k)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 35: nico/corsi/aa/LDF

Perceptron Training I

Set all coefficient ai to zerodo

for each instance y in the training dataif y is classified incorrectly by the perceptron

if y belongs to ω1 add it to ~aelse subtract it from ~a

until all instances in the training data are classified correctlyreturn ~a

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 36: nico/corsi/aa/LDF

Perceptron Training II

BATCH PERCEPTRON TRAINING

Initialize ~a, η, θ, k ← 0do

k ← k + 1~a← ~a + η(k)

∑~y∈Yk

~yuntil | η(k)

∑~y∈Yk

|< θ

return ~a

Can prove it will converge

If training data is linearly separable

and η sufficiently small

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 37: nico/corsi/aa/LDF

Perceptron Training III

Why does this work?Consider situation where an instance pertaining to the firstclass has been added:

(a0 + y0)y0 + (a1 + y1)y1 + (a2 + y2)y2 + . . .+ (ad + ad )yd

This means output for ~a has increased by:

y0y0 + y1y1 + y2y2 + . . .+ ydyd

always positive,thus the hyperplane has moved into the correct direction(and output decreases for instances of other class)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 38: nico/corsi/aa/LDF

Perceptron Training IV

η = 1 and ~a(1) = ~0.Sequence of misclassified instances: ~y1 + ~y2 + ~y3, ~y2, ~y3, ~y1, ~y3 stop

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 39: nico/corsi/aa/LDF

Perceptron

SimplificationFIXED-INCREMENT SINGLE-EXAMPLE PERCEPTRON

input: {~y (k)}nk=1 training examplesbegin initialize ~a, k = 0

do k ← (k + 1) mod nif ~y (k) is misclassified by the model based on ~athen ~a← ~a + ~y (k)

until all examples properly classifiedreturn ~a

end

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 40: nico/corsi/aa/LDF

Generalizations I

VARIABLE-INCREMENT PERCEPTRON WITH MARGIN

begininitialize ~a, θ, margin b, η, k ← 0do

k ← (k + 1) mod nif ~at~y (k) ≤ b then ~a← ~a + ~y (k)

until ~at~y (k) > b for all kreturn ~a

end

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 41: nico/corsi/aa/LDF

Generalizations II

BATCH VARIABLE-INCREMENT PERCEPTRON

begininitialize ~a, η, k ← 0do

k ← (k + 1) mod nYk ← ∅j ← 0do

j ← j + 1if yj misclassified then Yk ← Yk ∪ {yj}

until j = n~a← ~a +

∑~y∈Yk

~yuntil Yk = ∅return ~a

end

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 42: nico/corsi/aa/LDF

Comments

Perceptron adjusts the parameters only when it encountersan error, i.e. a misclassified training exampleCorrectly classified examples can be ignoredThe learning rate η can be chosen arbitrarily,it will only impact on the norm of the final ~a(and the corresponding magnitude of a0)The final weight vector ~a is a linear combination of trainingpoints

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 43: nico/corsi/aa/LDF

Nonseparable Case

The Perceptron is an error correcting procedure convergeswhen the examples are linearly separableEven if a separating vector is found for the trainingexamples, it does not follow that the resulting classifier willperform well on independent test dataTo ensure that the performance on training and test datawill be similar, many training examples should be used.Sufficiently large training examples are almost certainlynon linearly separableNo weight vector can correctly classify every example in anonseparable set

The corrections may never cease if set is nonseparable

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 44: nico/corsi/aa/LDF

Learning rate

If we choose η(k)→ 0 as k →∞ then performance can beacceptable on non-separable problems while preservingthe ability to find a solution on separable problemsη(k) can be considered as a function of recentperformance, decreasing it as performance improves: e.g.η(k)← η/kThe rate at which η(k) approaches zero is important:

Too slow: result will be sensitive to those examples that render the setnon-separable

Too fast: may converge prematurely with sub-optimal results

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 45: nico/corsi/aa/LDF

Linear Models: WINNOW

Another mistake-driven algorithm for finding a separatinghyperplaneAssumes binary attributes (i.e. propositional variables)Main difference: multiplicative instead of additive updates

Weights are multiplied by a parameter α > 1 (or its inverse)

Another difference: user-specified threshold parameter θPredict first class if w0 + w1x1 + w2x2 + · · ·+ wkxk > θ

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 46: nico/corsi/aa/LDF

The Algorithm I

WINNOW

initialize ~a, αwhile some instances are misclassified

for each instance ~y in the training dataclassify ~y using the current model ~aif the predicted class is incorrect

if y belongs to the target classfor each attribute yi = 1, multiply ai by α(if yi = 0, ai is left unchanged)

otherwisefor each attribute yi = 1, divide ai by α(if yi = 0, ai is left unchanged)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 47: nico/corsi/aa/LDF

The Algorithm II

WINNOW is very effective in homing in on relevant features(it is attribute efficient)

Can also be used in an on-line setting in which newinstances arrive continuously(like the perceptron algorithm)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 48: nico/corsi/aa/LDF

Balanced WINNOW I

WINNOW doesn’t allow negative weights and this can be adrawback in some applications

BALANCED WINNOW maintains two weight vectors, one foreach class: a+ and a−

Instance is classified as belonging to the first class (of twoclasses) if:(a+

0 −a−0 )+(a+1 −a−1 )y1 +(a+

2 −a−2 )y2 +· · ·+(a+k −a−k )yk > θ

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 49: nico/corsi/aa/LDF

Balanced WINNOW II

BALANCED WINNOW

while some instances are misclassifiedfor each instance a in the training data

classify a using the current weightsif the predicted class is incorrect

if a belongs to the first classfor each attribute yi = 1,

multiply a+i by α and divide a−i by α

(if yi = 0, leave a+i and a−i unchanged)

otherwisefor each attribute yi = 1,

multiply a−i by α and divide a+i by α

(if yi = 0, leave a+i and a−i unchanged)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 50: nico/corsi/aa/LDF

Minimum Squared Error Approach I

Minimum Squared Error (MSE)It trades the ability to obtain a separating vector for goodperformance on both separable and non-separableproblemsPreviously, we sought a weight vector ~a making all of theinner products ~at~y ≥ 0In the MSE procedure, one tries to make ~at~yi = bi ,where bi are some arbitrarily specified positive constants

Using matrix notation: Y~a = ~bIf Y is nonsingular, then ~a = Y−1~bUnfortunately Y is not a square matrix, usually with morerows than columnsWhen there are more equations than unknowns, ~a isoverdetermined, and ordinarily no exact solution exists.

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 51: nico/corsi/aa/LDF

Minimum Squared Error Approach II

We can seek a weight vector ~a that minimizes somefunction of an error vector ~e = Y~a− ~bMinimizing the squared length of the error vector isequivalent to minimizing the sum-of-squared-error criterionfunction

J(~a) = ||Y~a− ~b||2 =n∑

i=1

(~at~yi − bi)2

whose gradient is

∇J = 2n∑

i=1

(~at~yi − bi)~yi = 2Y t (Y~a− ~b)

Setting the gradient equal to zero, the following necessarycondition holds: Y tY~a = Y t~b

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 52: nico/corsi/aa/LDF

Minimum Squared Error Approach III

Y tY is a square matrix which is often nonsingular.Therefore, solving for ~a:

~a = (Y tY )−1Y t~b = Y +~b

where Y + = (Y tY )−1Y t is the pseudo-inverse of Y

Y + can be written also as limε→0(Y tY + εI)−1Y t

and it can be shown that this limit always exists, hence

~a = Y +~b

the MSE solution to the problem Y~a = ~b

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 53: nico/corsi/aa/LDF

WIDROW-HOFF procedure a.k.a. LMS I

The criterion function J(~a) = ||Y~a− ~b||2 could beminimized by a gradient descent procedureAdvantages:

Avoids the problems that arise when Y tY is singularAvoids the need for working with large matrices

Since ∇J = 2Y t (Y~a− ~b) a simple update rule would be{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)(Y~a− ~b)

or, if we consider the examples sequentially{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)

[bk − ~a(k)t~y(k)

]~y(k)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 54: nico/corsi/aa/LDF

WIDROW-HOFF procedure a.k.a. LMS II

LMS({~yi}ni=1)

input {~yi}ni=1: training examplesbeginInitialize ~a, ~b, θ, η(·), k ← 0

do k ← k + 1 mod n~a← ~a + η(k)(bk − ~a(k)t~y(k))~y(k)

until |η(k)(bk − ~a(k)t~y(k))~y(k)| < θreturn ~aend

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 55: nico/corsi/aa/LDF

Summary

Perceptron training rule guaranteed to succeed ifTraining examples are linearly separableSufficiently small learning rate η

Linear unit training rule uses gradient descentGuaranteed to converge to hypothesis with MSEGiven sufficiently small learning rate ηEven when training data contains noiseEven when training data not separable by H

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 56: nico/corsi/aa/LDF

Linear Regression

Standard technique for numeric predictionOutcome is linear combination of attributes:

x = w0 + w1x1 + w2x2 + · · ·+ wdxd

Weights are calculated from the training datastandard math algorithms ~w

Predicted value for first training instance ~x (1)

w0 + w1x (1)1 + w2x (1)

2 + · · ·+ wdx (1)d =

d∑j=0

wjx(1)j

assuming extended vectors with x0 = 1

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 57: nico/corsi/aa/LDF

Probabilistic Classification

Multiresponse Linear Regression (MLR)Any regression technique can be used for classification

Training:perform a regression for each class→ gi linearcompute each linear expression for each class,setting the output to 1 for training instances that belong tothe class and 0 for those that don’tPrediction:predict class corresponding to model with largest outputvalue (membership value)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 58: nico/corsi/aa/LDF

Logistic Regression I

MLR drawbacks1 membership values are not proper probabilities

they can fall outside [0,1]

2 least squares regression assumes that: the errors are notonly statistically independent,but are also normally distributed with the same standarddeviation

Logit transformation does not suffer from these problems

Builds a linear model for a transformed target variable

Assume we have two classes

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 59: nico/corsi/aa/LDF

Logistic Regression II

Logistic regression replaces the target

Pr(1 | ~x)

that cannot be approximated well using a linear functionwith this target

log(

Pr(1 | ~x)

1− Pr(1 | ~x)

)

Transformation maps [0,1] to (−∞,+∞)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 60: nico/corsi/aa/LDF

Logistic Regression III

logit tranformation function

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 61: nico/corsi/aa/LDF

Example: Logistic Regression Model

Resulting model: Pr(1 | ~y) = 1/(1 + e−(a0+a1y1+a2y2+···+ad yd )

)Example: Model with a0 = 0.5 and a1 = 1:

Parameters induced from data using maximum likelihoodCorso di Apprendimento Automatico Linear Discrimination Functions

Page 62: nico/corsi/aa/LDF

Maximum Likelihood

Aim: maximize probability of training data with respect tothe parametersCan use logarithms of probabilities and maximizelog-likelihood of model and MSE:

n∑i=1

(1− x (i)

)log(

1− Pr(1 | ~y (i)))

+x (i) log(

1− Pr(1 | ~y (i)))

where the x (i)’s are the responses (either 0 or 1)

Weights ai need to be chosen to maximize log-likelihoodrelatively simple method:iteratively re-weighted least squares

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 63: nico/corsi/aa/LDF

Credits

R. Duda, P. Hart, D. Stork: Pattern Classification, WileyT. M. Mitchell: Machine Learning, McGraw HillI. Witten & E. Frank: Data Mining: Practical MachineLearning Tools and Techniques, Morgan Kaufmann

Corso di Apprendimento Automatico Linear Discrimination Functions