Upload
nicola-fanizzi
View
216
Download
0
Embed Size (px)
DESCRIPTION
http://lacam.di.uniba.it:8000/%7Enico/corsi/aa/LDF.pdf
Citation preview
Linear Discrimination Functions
Corso di Apprendimento AutomaticoLaurea Magistrale in Informatica
Nicola Fanizzi
Dipartimento di InformaticaUniversità degli Studi di Bari
November 4, 2009
Corso di Apprendimento Automatico Linear Discrimination Functions
Outline
Linear modelsGradient descentPerceptronMinimum square error approachLinear and logistic regression
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Discriminant Functions I
A linear discriminant function can be written as
g(x) = w1x1 + · · ·+ wdxd + w0 = ~w t~x + w0
where~w is the weight vectorw0 is the bias or threshold
A 2-class linear classifier implements the decision rule:
Decide ω1 if g(x) > 0 and ω2 if g(x) < 0
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Discriminant Functions II
The equation g(x) = 0 defines the decision surface thatseparates points assigned to ω1 from points assigned to ω2.
When g(x) is linear, this decision surface is a hyperplane (H).
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Discriminant Functions III
H divides the feature space into 2 half spaces:R1 for 1, and R2 for 2
If x1 and x2 are both on the decision surface
~w t~x1 + w0 = ~w t~x2 + w0 ⇒ ~w t (~x1 − ~x2) = 0
w is normal to any vector lying in the hyperplane
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Discriminant Functions IV
If we express ~x as
~x = ~xp + r~w||~w ||
where ~xp is the normal projection of ~x onto H, and r is thealgebraic distance from ~x to the hyperplane
Since g(~xp) = 0,we have g(~x) = ~w t~x + w0 = r ||~w || i.e. r = g(~x)
||~w ||
r is signed distance:r > 0 if ~x falls in R1, r < 0 if ~x falls in R2
Distance from the origin to the hyperplane is w0||~w ||
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Discriminant Functions V
Corso di Apprendimento Automatico Linear Discrimination Functions
Multi-category Case I
2 approaches to extend the LDF approach to the multi-categorycase:
ωi / not ωi Reduce the problem to c − 1 two-class problems:Problem #i : Find the functions that separatespoints assigned to ωi from those not assigned to ωi
ωi / ωj Find the c(c − 1)/2 linear discriminants,one for every pair of classes
Both approaches can lead to regions in which the classificationis undefined
Corso di Apprendimento Automatico Linear Discrimination Functions
Multi-category Case II
Corso di Apprendimento Automatico Linear Discrimination Functions
Pairwise Classification
Idea:build model for each pair of classes,using only training data from those classesProblem:solve c(c − 1)/2 classification problems for c classesTurns out not to be a problem in many cases becausetraining sets become small:
Assume data evenly distributed, i.e. 2n/c per learningproblem for n instances in totalSuppose learning algorithm is linear in nThen runtime of pairwise classification is proportional toc(c−1)
2 × 2nc = (c − 1)n
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Machine I
Define c linear discriminant functions:
gi(~x) = ~w ti ~x + wi0 i = 1, . . . , c
Linear Machine classifier: ~x ∈ ωi if gi(~x) > gj(~x) for all i 6= jIn case of equal scores, the classification is undefined
A LM divides the feature space into c decision regions,with gi(~x) the largest discriminant if ~x is in Ri
If Ri and Rj are contiguous, the boundary between them isa portion of the hyperplane Hij defined by:
gi(~x) = gj(~x) or (~wi − ~wj)t~x + (wi0 − wj0)
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Machine IIIt follows that ~wi − ~wj is normal to HijThe signed distance from ~x to Hij is:
gi(~x)− gj(~x)
||~wi − ~wj ||
There are c(c − 1)/2 pairs of convex regionsNot all regions are contiguous, and the total number ofsegments in the surfaces is often less than c(c − 1)/2
3- and 5-class problems
Corso di Apprendimento Automatico Linear Discrimination Functions
Generalized LDF I
The LDF is g(~x) = w0 +∑d
i=1 wixi
Adding d(d + 1)/2 terms involving the products of pairs ofcomponents of ~x , quadratic discriminant function:
g(~x) = w0 +d∑
i=1
wixi +d∑
i=1
d∑j=1
wijxixj
The separating surface defined by g(~x) = 0 is a second-degreeor hyperquadric surface
Add more terms, such as wijkxixjxk , we obtain polynomialdiscriminant functions
Corso di Apprendimento Automatico Linear Discrimination Functions
Generalized LDF II
The generalized LDF is defined
g(~x) =d̂∑
i=1
aiyi(~x) = ~at~y
where:~a is a d̂-dimensional weight vector andyi(~x) are arbitrary functions of ~x
The resulting discriminant function is not linear in ~x ,but it is linear in ~y
The functions yi(~x) map points in d-dimensional ~x-spaceto points in the d̂-dimensional ~y -space
Corso di Apprendimento Automatico Linear Discrimination Functions
Generalized LDF III
Example: Let the QDF be g(~x) = a1 + a2x + a3x2
The 3-dimensional vector is then y =
1xx2
Corso di Apprendimento Automatico Linear Discrimination Functions
2-class Linearly-Separable Case I
g(~x) =d∑
i=0
wixi = ~at~y
where x0 = 1 and~y t = [1 ~x ] = [1 x1 · · · xd ] is an augmented feature vector and~at = [w0 ~w ] = [w0 w1 · · · wd ] is an augmented weight vector
The hyperplane decision surface H defined ~at~y = 0passes through the origin in ~y -space
The distance from any point ~y to H is given by ~at~y||~a|| = g(~x)
||~a||
Because ~a =√
(1 + ||~w ||2) this distance is less then thedistance from ~x to H
Corso di Apprendimento Automatico Linear Discrimination Functions
2-class Linearly-Separable Case II
Problem: find [w0 ~w ] = ~a
Suppose that we have a set of n examples {~y1, . . . , ~yn}labeled ω1 or ω2
Look for a weight vector ~a that classifies all the examplescorrectly:
~at~yi > 0 and ~yi is labeled ω1 or~at~yi < 0 and ~yi is labeled ω2
If ~a exists, the examples are linearly separable
Corso di Apprendimento Automatico Linear Discrimination Functions
2-class Linearly-Separable Case III
SolutionsReplacing all the examples labeled ω2 by their negatives,one can look for a weight vector ~a such that ~at~yi > 0 for allthe examples~a a.k.a. separating vector or solution vector
Each example ~yi places a constraint on the possiblelocation of a solution vector~at~yi = 0 defines a hyperplane through the origin having ~yias a normal vector
The solution vector (if it exists) must be on the positive sideof every hyperplane
Solution Region = intersection of the n half-spaces
Corso di Apprendimento Automatico Linear Discrimination Functions
2-class Linearly-Separable Case IV
Any vector that lies in the solution region is a solutionvector: the solution vector (if it exists) is not uniqueAdditional requirements to find a solution vectorcloser to the middle of the region(i.e. more likely to classify new examples correctly)Seek a unit-length weight vector that maximizes theminimum distance from the examples to the hyperplane
Corso di Apprendimento Automatico Linear Discrimination Functions
2-class Linearly-Separable Case V
Seek the minimum-length weight vector satisfying
~at~yi ≥ b ≥ 0
The solution region shrinks by margin: b/||~yi ||
Corso di Apprendimento Automatico Linear Discrimination Functions
Gradient Descent I
Define a criterion function J(~a) that is minimized when ~a isa solution vector: ~at~yi ≥ 0, ∀i = 1, . . . ,n
Start with some arbitrary vector ~a(1)
Compute the gradient vector ∇J(~a(1))
The next value ~a(2) is obtained by moving a distance from~a(1) in the direction of steepest descenti.e. along the negative of the gradient
In general, ~a(k + 1) is obtained from ~a(k) using
~a(k + 1)← ~a(k)− η(k)∇J(~a(k))
where η(k) is the learning rate
Corso di Apprendimento Automatico Linear Discrimination Functions
Gradient Descent II
Corso di Apprendimento Automatico Linear Discrimination Functions
Gradient Descent & Delta Rule I
To understand, consider a simpler linear machine (a.k.a. unit),where
o = w0 + w1x1 + · · ·+ wnxn
Let’s learn wi ’s that minimize the squared error,i.e. J(w) = E [~w ]
E [~w ] ≡ 12
∑d∈D
(~td − ~od )2
where:D is set of training examples 〈~x , t〉t is the target output value
Corso di Apprendimento Automatico Linear Discrimination Functions
Gradient Descent & Delta Rule II
Gradient
∇E [~w ] ≡[∂E∂w0
,∂E∂w1
, · · · ∂E∂wn
]Training rule:
∆~w = −η∇E [~w ]
i.e.,
∆wi = −η ∂E∂wi
Note that η may be a constant
Corso di Apprendimento Automatico Linear Discrimination Functions
Gradient Descent & Delta Rule III
∂E∂wi
=∂
∂wi
12
∑d
(td − od )2
=12
∑d
∂
∂wi(td − od )2
=12
∑d
2(td − od )∂
∂wi(td − od )
=∑
d
(td − od )∂
∂wi(td − ~w · ~xd )
∂E∂wi
=∑
d
(td − od )(−xid )
Corso di Apprendimento Automatico Linear Discrimination Functions
Basic GRADIENT-DESCENT Algorithm
GRADIENT-DESCENT(D, η)D: training set,η: learning rate (e.g. .5)
Initialize each wi to some small random valueuntil the termination condition is met do
Initialize each ∆wi to zero.for each 〈~x , t〉 ∈ D do
Input the instance ~x to the unit and compute the output ofor each wi do
∆wi ← ∆wi + η(t − o)xi
for each weight wi do
wi ← wi + ∆wi
Corso di Apprendimento Automatico Linear Discrimination Functions
Incremental (Stochastic) GRADIENT DESCENT I
Approximation of the standard GRADIENT-DESCENT
Batch GRADIENT-DESCENT:Do until satisfied
1 Compute the gradient ∇ED[~w ]
2 ~w ← ~w − η∇ED[~w ]
Incremental GRADIENT-DESCENT:Do until satisfied
For each training example d in D1 Compute the gradient ∇Ed [~w ]2 ~w ← ~w − η∇Ed [~w ]
Corso di Apprendimento Automatico Linear Discrimination Functions
Incremental (Stochastic) GRADIENT DESCENT II
ED[~w ] ≡ 12
∑d∈D
(td − od )2
Ed [~w ] ≡ 12
(td − od )2
Training rule (delta rule):
∆wi ← η(t − o)xi
similar to perceptron training rule, yet unthresholdedconvergence is only asymptotically guaranteedlinear separability is no longer needed !
Corso di Apprendimento Automatico Linear Discrimination Functions
Standard vs. Stochastic GRADIENT-DESCENT
Incremental-GD can approximate Batch-GD arbitrarily closely ifη made small enough
error summed over all examples before summing updatedupon each examplestandard GD more costly per update step and can employlarger ηstochastic GD may avoid falling in local minima because ofusing Ed instead of ED
Corso di Apprendimento Automatico Linear Discrimination Functions
Newton’s Algorithm
J(~a) ' J(~a(k)) +∇J t (~a− ~a(k)) +12
(~a− ~a(k))tH(~a− ~a(k))
where H = ∂2J∂ai∂aj
is the Hessian matrix
Choose ~a(k + 1) to minimize this function:~a(k + 1)← ~a(k)− H−1∇J(~a)
Greater improvement per step than GD but not applicablewhen H is singularTime complexity O(d3)
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron I
Assumption:data is linearly separable
Hyperplane:∑d
i=0 wixi = 0assuming that there is a constant attribute x0 = 1 (bias)
Algorithm for learning separating hyperplane:perceptron learning rule
Classifier:If∑d
i=0 wixi > 0 then predict ω1 (or +1),otherwise predict ω2 (or −1)
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron II
Thresholded output
o(x1, . . . , xn) =
{+1 if w0 + w1x1 + · · ·+ wdxd > 0−1 otherwise.
Simpler vector notation: o(~x) = sgn(~x) =
{+1 if ~w~x > 0−1 otherwise.
Space of the hypotheses: {~w | ~w ∈ Rn}
Corso di Apprendimento Automatico Linear Discrimination Functions
Decision Surface of a Perceptron
Can represent some useful functionsWhat weights represent g(x1, x2) = AND(x1, x2)?
But some functions not representablee.g., not linearly separable (XOR)Therefore, we’ll want networks of these...
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron Training Rule I
Perceptron criterion function: J(~a) =∑
~y∈Y (~a)(−~at~y)
where Y (~a) is the set of examples misclassified by ~aIf no examples are misclassified, Y (~a) is empty andJ(~a) = 0 (i.e. ~a is a solution vector)J(~a) ≥ 0, since ~at~yi ≤ 0 if ~yi is misclassified
Geometrically, J(~a) is proportional to the sum of thedistances from the misclassified examples to the decisionboundary
Since ∇J =∑
~y∈Y (~a)(−~y) the update rule becomes
~a(k + 1)← ~a(k) + η(k)∑
~y∈Yk (~a)
~y
where Y (~a) is the set of examples misclassified by ~a(k)
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron Training I
Set all coefficient ai to zerodo
for each instance y in the training dataif y is classified incorrectly by the perceptron
if y belongs to ω1 add it to ~aelse subtract it from ~a
until all instances in the training data are classified correctlyreturn ~a
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron Training II
BATCH PERCEPTRON TRAINING
Initialize ~a, η, θ, k ← 0do
k ← k + 1~a← ~a + η(k)
∑~y∈Yk
~yuntil | η(k)
∑~y∈Yk
|< θ
return ~a
Can prove it will converge
If training data is linearly separable
and η sufficiently small
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron Training III
Why does this work?Consider situation where an instance pertaining to the firstclass has been added:
(a0 + y0)y0 + (a1 + y1)y1 + (a2 + y2)y2 + . . .+ (ad + ad )yd
This means output for ~a has increased by:
y0y0 + y1y1 + y2y2 + . . .+ ydyd
always positive,thus the hyperplane has moved into the correct direction(and output decreases for instances of other class)
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron Training IV
η = 1 and ~a(1) = ~0.Sequence of misclassified instances: ~y1 + ~y2 + ~y3, ~y2, ~y3, ~y1, ~y3 stop
Corso di Apprendimento Automatico Linear Discrimination Functions
Perceptron
SimplificationFIXED-INCREMENT SINGLE-EXAMPLE PERCEPTRON
input: {~y (k)}nk=1 training examplesbegin initialize ~a, k = 0
do k ← (k + 1) mod nif ~y (k) is misclassified by the model based on ~athen ~a← ~a + ~y (k)
until all examples properly classifiedreturn ~a
end
Corso di Apprendimento Automatico Linear Discrimination Functions
Generalizations I
VARIABLE-INCREMENT PERCEPTRON WITH MARGIN
begininitialize ~a, θ, margin b, η, k ← 0do
k ← (k + 1) mod nif ~at~y (k) ≤ b then ~a← ~a + ~y (k)
until ~at~y (k) > b for all kreturn ~a
end
Corso di Apprendimento Automatico Linear Discrimination Functions
Generalizations II
BATCH VARIABLE-INCREMENT PERCEPTRON
begininitialize ~a, η, k ← 0do
k ← (k + 1) mod nYk ← ∅j ← 0do
j ← j + 1if yj misclassified then Yk ← Yk ∪ {yj}
until j = n~a← ~a +
∑~y∈Yk
~yuntil Yk = ∅return ~a
end
Corso di Apprendimento Automatico Linear Discrimination Functions
Comments
Perceptron adjusts the parameters only when it encountersan error, i.e. a misclassified training exampleCorrectly classified examples can be ignoredThe learning rate η can be chosen arbitrarily,it will only impact on the norm of the final ~a(and the corresponding magnitude of a0)The final weight vector ~a is a linear combination of trainingpoints
Corso di Apprendimento Automatico Linear Discrimination Functions
Nonseparable Case
The Perceptron is an error correcting procedure convergeswhen the examples are linearly separableEven if a separating vector is found for the trainingexamples, it does not follow that the resulting classifier willperform well on independent test dataTo ensure that the performance on training and test datawill be similar, many training examples should be used.Sufficiently large training examples are almost certainlynon linearly separableNo weight vector can correctly classify every example in anonseparable set
The corrections may never cease if set is nonseparable
Corso di Apprendimento Automatico Linear Discrimination Functions
Learning rate
If we choose η(k)→ 0 as k →∞ then performance can beacceptable on non-separable problems while preservingthe ability to find a solution on separable problemsη(k) can be considered as a function of recentperformance, decreasing it as performance improves: e.g.η(k)← η/kThe rate at which η(k) approaches zero is important:
Too slow: result will be sensitive to those examples that render the setnon-separable
Too fast: may converge prematurely with sub-optimal results
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Models: WINNOW
Another mistake-driven algorithm for finding a separatinghyperplaneAssumes binary attributes (i.e. propositional variables)Main difference: multiplicative instead of additive updates
Weights are multiplied by a parameter α > 1 (or its inverse)
Another difference: user-specified threshold parameter θPredict first class if w0 + w1x1 + w2x2 + · · ·+ wkxk > θ
Corso di Apprendimento Automatico Linear Discrimination Functions
The Algorithm I
WINNOW
initialize ~a, αwhile some instances are misclassified
for each instance ~y in the training dataclassify ~y using the current model ~aif the predicted class is incorrect
if y belongs to the target classfor each attribute yi = 1, multiply ai by α(if yi = 0, ai is left unchanged)
otherwisefor each attribute yi = 1, divide ai by α(if yi = 0, ai is left unchanged)
Corso di Apprendimento Automatico Linear Discrimination Functions
The Algorithm II
WINNOW is very effective in homing in on relevant features(it is attribute efficient)
Can also be used in an on-line setting in which newinstances arrive continuously(like the perceptron algorithm)
Corso di Apprendimento Automatico Linear Discrimination Functions
Balanced WINNOW I
WINNOW doesn’t allow negative weights and this can be adrawback in some applications
BALANCED WINNOW maintains two weight vectors, one foreach class: a+ and a−
Instance is classified as belonging to the first class (of twoclasses) if:(a+
0 −a−0 )+(a+1 −a−1 )y1 +(a+
2 −a−2 )y2 +· · ·+(a+k −a−k )yk > θ
Corso di Apprendimento Automatico Linear Discrimination Functions
Balanced WINNOW II
BALANCED WINNOW
while some instances are misclassifiedfor each instance a in the training data
classify a using the current weightsif the predicted class is incorrect
if a belongs to the first classfor each attribute yi = 1,
multiply a+i by α and divide a−i by α
(if yi = 0, leave a+i and a−i unchanged)
otherwisefor each attribute yi = 1,
multiply a−i by α and divide a+i by α
(if yi = 0, leave a+i and a−i unchanged)
Corso di Apprendimento Automatico Linear Discrimination Functions
Minimum Squared Error Approach I
Minimum Squared Error (MSE)It trades the ability to obtain a separating vector for goodperformance on both separable and non-separableproblemsPreviously, we sought a weight vector ~a making all of theinner products ~at~y ≥ 0In the MSE procedure, one tries to make ~at~yi = bi ,where bi are some arbitrarily specified positive constants
Using matrix notation: Y~a = ~bIf Y is nonsingular, then ~a = Y−1~bUnfortunately Y is not a square matrix, usually with morerows than columnsWhen there are more equations than unknowns, ~a isoverdetermined, and ordinarily no exact solution exists.
Corso di Apprendimento Automatico Linear Discrimination Functions
Minimum Squared Error Approach II
We can seek a weight vector ~a that minimizes somefunction of an error vector ~e = Y~a− ~bMinimizing the squared length of the error vector isequivalent to minimizing the sum-of-squared-error criterionfunction
J(~a) = ||Y~a− ~b||2 =n∑
i=1
(~at~yi − bi)2
whose gradient is
∇J = 2n∑
i=1
(~at~yi − bi)~yi = 2Y t (Y~a− ~b)
Setting the gradient equal to zero, the following necessarycondition holds: Y tY~a = Y t~b
Corso di Apprendimento Automatico Linear Discrimination Functions
Minimum Squared Error Approach III
Y tY is a square matrix which is often nonsingular.Therefore, solving for ~a:
~a = (Y tY )−1Y t~b = Y +~b
where Y + = (Y tY )−1Y t is the pseudo-inverse of Y
Y + can be written also as limε→0(Y tY + εI)−1Y t
and it can be shown that this limit always exists, hence
~a = Y +~b
the MSE solution to the problem Y~a = ~b
Corso di Apprendimento Automatico Linear Discrimination Functions
WIDROW-HOFF procedure a.k.a. LMS I
The criterion function J(~a) = ||Y~a− ~b||2 could beminimized by a gradient descent procedureAdvantages:
Avoids the problems that arise when Y tY is singularAvoids the need for working with large matrices
Since ∇J = 2Y t (Y~a− ~b) a simple update rule would be{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)(Y~a− ~b)
or, if we consider the examples sequentially{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)
[bk − ~a(k)t~y(k)
]~y(k)
Corso di Apprendimento Automatico Linear Discrimination Functions
WIDROW-HOFF procedure a.k.a. LMS II
LMS({~yi}ni=1)
input {~yi}ni=1: training examplesbeginInitialize ~a, ~b, θ, η(·), k ← 0
do k ← k + 1 mod n~a← ~a + η(k)(bk − ~a(k)t~y(k))~y(k)
until |η(k)(bk − ~a(k)t~y(k))~y(k)| < θreturn ~aend
Corso di Apprendimento Automatico Linear Discrimination Functions
Summary
Perceptron training rule guaranteed to succeed ifTraining examples are linearly separableSufficiently small learning rate η
Linear unit training rule uses gradient descentGuaranteed to converge to hypothesis with MSEGiven sufficiently small learning rate ηEven when training data contains noiseEven when training data not separable by H
Corso di Apprendimento Automatico Linear Discrimination Functions
Linear Regression
Standard technique for numeric predictionOutcome is linear combination of attributes:
x = w0 + w1x1 + w2x2 + · · ·+ wdxd
Weights are calculated from the training datastandard math algorithms ~w
Predicted value for first training instance ~x (1)
w0 + w1x (1)1 + w2x (1)
2 + · · ·+ wdx (1)d =
d∑j=0
wjx(1)j
assuming extended vectors with x0 = 1
Corso di Apprendimento Automatico Linear Discrimination Functions
Probabilistic Classification
Multiresponse Linear Regression (MLR)Any regression technique can be used for classification
Training:perform a regression for each class→ gi linearcompute each linear expression for each class,setting the output to 1 for training instances that belong tothe class and 0 for those that don’tPrediction:predict class corresponding to model with largest outputvalue (membership value)
Corso di Apprendimento Automatico Linear Discrimination Functions
Logistic Regression I
MLR drawbacks1 membership values are not proper probabilities
they can fall outside [0,1]
2 least squares regression assumes that: the errors are notonly statistically independent,but are also normally distributed with the same standarddeviation
Logit transformation does not suffer from these problems
Builds a linear model for a transformed target variable
Assume we have two classes
Corso di Apprendimento Automatico Linear Discrimination Functions
Logistic Regression II
Logistic regression replaces the target
Pr(1 | ~x)
that cannot be approximated well using a linear functionwith this target
log(
Pr(1 | ~x)
1− Pr(1 | ~x)
)
Transformation maps [0,1] to (−∞,+∞)
Corso di Apprendimento Automatico Linear Discrimination Functions
Logistic Regression III
logit tranformation function
Corso di Apprendimento Automatico Linear Discrimination Functions
Example: Logistic Regression Model
Resulting model: Pr(1 | ~y) = 1/(1 + e−(a0+a1y1+a2y2+···+ad yd )
)Example: Model with a0 = 0.5 and a1 = 1:
Parameters induced from data using maximum likelihoodCorso di Apprendimento Automatico Linear Discrimination Functions
Maximum Likelihood
Aim: maximize probability of training data with respect tothe parametersCan use logarithms of probabilities and maximizelog-likelihood of model and MSE:
n∑i=1
(1− x (i)
)log(
1− Pr(1 | ~y (i)))
+x (i) log(
1− Pr(1 | ~y (i)))
where the x (i)’s are the responses (either 0 or 1)
Weights ai need to be chosen to maximize log-likelihoodrelatively simple method:iteratively re-weighted least squares
Corso di Apprendimento Automatico Linear Discrimination Functions
Credits
R. Duda, P. Hart, D. Stork: Pattern Classification, WileyT. M. Mitchell: Machine Learning, McGraw HillI. Witten & E. Frank: Data Mining: Practical MachineLearning Tools and Techniques, Morgan Kaufmann
Corso di Apprendimento Automatico Linear Discrimination Functions