Discriminantanalysisand* classiﬁca/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classiﬁer is not realistic as it requires

Discriminant*analysis*and*classifica/on**

ISyE*6416,*Spring*2015*Yao*Xie*

Classification

Classification is a predictive task in which the response takesvalues across discrete categories (i.e., not continuous), and in themost fundamental case, two categories

Examples:

I Predicting whether a patient will develop breast cancer orremain healthy, given genetic information

I Predicting whether or not a user will like a new product, basedon user covariates and a history of his/her previous ratings

I Predicting the region of Italy in which a brand of olive oil wasmade, based on its chemical composition

I Predicting the next elected president, based on various social,political, and historical measurements

2

Constructed from training data (xi, yi), i = 1, . . . n, we denote ourclassification rule by ˆ

f(x); given any x 2 Rp, this returns a classlabel ˆ

f(x) 2 {1, . . .K}

As before, we will see that there are two di↵erent ways of assessingthe quality of ˆ

f : its predictive ability and interpretative ability

E.g., train on (xi, yi), i = 1, . . . n,the data of elected presidentsand related feature measurementsxi 2 Rp for the past n elections,and predict, given the current fea-ture measurements x0 2 Rp, thewinner of the current election

In what situations would we care more about prediction error? Andin what situations more about interpretation?

4

Constructed from training data (xi, yi), i = 1, . . . n, we denote ourclassification rule by ˆ

f(x); given any x 2 Rp, this returns a classlabel ˆ

f(x) 2 {1, . . .K}

As before, we will see that there are two di↵erent ways of assessingthe quality of ˆ

f : its predictive ability and interpretative ability

E.g., train on (xi, yi), i = 1, . . . n,the data of elected presidentsand related feature measurementsxi 2 Rp for the past n elections,and predict, given the current fea-ture measurements x0 2 Rp, thewinner of the current election

In what situations would we care more about prediction error? Andin what situations more about interpretation?

4

Linear discriminant analysis

Using the Bayes classifier is not realistic as it requires knowing theclass conditional densities P(X = x|C = j) and prior probabilities⇡j . But if estimate these quantities, then we can follow the idea

Linear discriminant analysis (LDA) does this by assuming that thedata within each class are normally distributed:

hj(x) = P(X = x|C = j) = N(µj ,⌃) density

We allow each class to have its own mean µj 2 Rp, but we assumea common covariance matrix ⌃ 2 Rp⇥p. Hence

hj(x) =1

(2⇡)

p/2det(⌃)

1/2exp

n

� 1

2

(x� µj)T⌃

�1(x� µj)

o

So we want to find j so that P(C = j|X = x) · ⇡j = hj(x) · ⇡j isthe largest

14

Linear discriminant analysis

Using the Bayes classifier is not realistic as it requires knowing theclass conditional densities P(X = x|C = j) and prior probabilities⇡j . But if estimate these quantities, then we can follow the idea

Linear discriminant analysis (LDA) does this by assuming that thedata within each class are normally distributed:

hj(x) = P(X = x|C = j) = N(µj ,⌃) density

We allow each class to have its own mean µj 2 Rp, but we assumea common covariance matrix ⌃ 2 Rp⇥p. Hence

hj(x) =1

(2⇡)

p/2det(⌃)

1/2exp

n

� 1

2

(x� µj)T⌃

�1(x� µj)

o

So we want to find j so that P(C = j|X = x) · ⇡j = hj(x) · ⇡j isthe largest

14

Since log(·) is a monotone function, we can consider maximizinglog(hj(x)⇡j) over j = 1, . . .K. We can define the rule:

f

LDA(x) = argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K�j(x)

We call �j(x), j = 1, . . .K the discriminant functions. Note

�j(x) = x

T⌃

�1µj �

1

2

µ

Tj ⌃

�1µj + log ⇡j

is just an a�ne function of x

15


f

LDA(x) = argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K�j(x)


�j(x) = x

T⌃

�1µj �

1

2

µ

Tj ⌃

�1µj + log ⇡j


15


f

LDA(x) = argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K�j(x)


�j(x) = x

T⌃

�1µj �

1

2

µ

Tj ⌃

�1µj + log ⇡j


15

In practice, given an input x 2 Rp, can we just use the rule f

LDA

on the previous slide? Not quite! What’s missing: we don’t know⇡j , µj , and ⌃. Therefore we estimate them based on the trainingdata xi 2 Rp and yi 2 {1, . . .K}, i = 1, . . . n, by:

I⇡j = nj/n, the proportion of observations in class j

Iµj =

1nj

P

yi=j xi, the centroid of class j

I ˆ

⌃ =

1n�K

PKj=1

P

yi=j(xi � µj)(xi � µj)T , the pooled sample

covariance matrix

(Here nj is the number of points in class j)

This gives the estimated discriminant functions:

ˆ

�j(x) = x

Tˆ

⌃

�1µj �

1

2

µ

Tjˆ

⌃

�1µj + log ⇡j

and finally the linear discriminant analysis rule,

ˆ

f

LDA(x) = argmax

j=1,...K

ˆ

�j(x)

16

LDA decision boundaries

The estimated discriminant functions

ˆ

�j(x) = x

Tˆ

⌃

�1µj �

1

2

µ

Tjˆ

⌃

�1µj + log ⇡j

= aj + b

Tj x

are just a�ne functions of x. The decision boundary betweenclasses j, k is the set of all x 2 Rp such that ˆ�j(x) = ˆ

�k(x), i.e.,

aj + b

Tj x = ak + b

Tk x

This defines an a�ne subspacein x:

aj � ak + (bj � bk)Tx = 0

17

Example: LDA decision boundaries

Example of decision boundaries from LDA (from ESL page 109):

f

LDA(x)

ˆ

f

LDA(x)

Are the decision boundaries the same as the perpendicular bisectors(Voronoi boundaries) between the class centroids? (Why not?)

18

The*Iris*flower*data*set*or*Fisher's*Iris*data*set*is*a*mul/variate*data*set*introduced*by*Sir*Ronald*Fisher*(1936)*as*an*example*of*discriminant*analysis.[1]**

The*data*set*consists*of*50*samples*from*each*of*three*species*of*Iris*(Iris*setosa,*Iris*virginica*and*Iris*versicolor).*Four*features*were*measured*from*each*sample:*the*length*and*the*width*of*the*sepals*and*petals,*in*cen/metres.***Based*on*the*combina/on*of*these*four*features,*Fisher*developed*a*linear*discriminant*model*to*dis/nguish*the*species*from*each*other.*

LDA computations, usages, extensions

The decision boundaries for LDA are useful for graphical purposes,but to classify a new point x0 2 Rp we don’t use them—we simplycompute ˆ

�j(x0) for each j = 1, . . .K

LDA performs quite well on a wide variety of data sets, even whenpitted against fancy alternative classification schemes. Though itassumes normality, its simplicity often works in its favor. (Why?Think of the bias-variance tradeo↵)

Still, there are some useful extensions of LDA. E.g.,

I Quadratic discriminant analysis: using the same normalmodel, we now allow each class j to have its own covariancematrix ⌃j . This leads to quadratic decision boundaries

I Reduced-rank linear discriminant analysis: we essentiallyproject the data to a lower dimensional subspace beforeperforming LDA. We will study this next time

19

LDA computations, usages, extensions

The decision boundaries for LDA are useful for graphical purposes,but to classify a new point x0 2 Rp we don’t use them—we simplycompute ˆ

�j(x0) for each j = 1, . . .K

LDA performs quite well on a wide variety of data sets, even whenpitted against fancy alternative classification schemes. Though itassumes normality, its simplicity often works in its favor. (Why?Think of the bias-variance tradeo↵)

Still, there are some useful extensions of LDA. E.g.,

I Quadratic discriminant analysis: using the same normalmodel, we now allow each class j to have its own covariancematrix ⌃j . This leads to quadratic decision boundaries

I Reduced-rank linear discriminant analysis: we essentiallyproject the data to a lower dimensional subspace beforeperforming LDA. We will study this next time

19

LDA computations and sphering

Note that LDA equivalently minimizes over j = 1, . . .K,

1

2

(x� µ

j

)

T

ˆ

⌃

�1(x� µ

j

)� log ⇡

j

It helps to factorize ˆ

⌃ (i.e., compute its eigendecomposition):

ˆ

⌃ = UDU

T

where U 2 Rp⇥p has orthonormal columns (and rows), andD = diag(d1, . . . dp) with d

j

� 0 for each j. Then we haveˆ

⌃

�1= UD

�1U

T , and

(x� µ

j

)

T

ˆ

⌃

�1(x� µ

j

) = kD�1/2U

T

x

j

| {z }

x

�D

�1/2U

T

µ

j

| {z }

µ

j

k22

This is just the squared distance between x and µ

j

4

Hence the LDA procedure can be described as:

1. Compute the sample estimates ⇡j

, µ

j

,

ˆ

⌃

2. Factor ˆ

⌃, as in ˆ

⌃ = UDU

T

3. Transform the class centroids µj

= D

�1/2U

T

µ

j

4. Given any point x 2 Rp, transform to x = D

�1/2U

T

x 2 Rp,and then classify according to the nearest centroid in thetransformed space, adjusting for class proportions—this is theclass j for which 1

2kx� µ

j

k22 � log ⇡

j

is smallest

What is this transformation doing? Think about applying it to theobservations:

x

i

= D

�1/2U

T

x

i

, i = 1, . . . n

This is basically sphering the data points, because if we think ofx 2 Rp were a random variable with covariance matrix ˆ

⌃, then

Cov(D

�1/2U

T

x) = D

�1/2U

T

ˆ

⌃UD

�1/2= I

5

Generally speaking, logistic regression is more flexible because itdoesn’t assume anything about the distribution of X. LDAassumes that X is normally distributed within each class, so thatits marginal distribution is a mixture of normal distributions, hencestill normal:

X ⇠KX

j=1

⇡jN(µj ,⌃)

This means that logistic regression is more robust to situations inwhich the class conditional densities are not normal (and outliers

On the other side, if the true class conditional densities are normal,or close to it, LDA will be more e�cient, meaning that for logisticregression to perform comparably it will need more data

In practice they tend to perform similarly in a variety of situations(as claimed by the ESL book on page 128)

10

•  LDA*is*not*necessarily*bad*when*assump/ons*about*the*density*func/ons*are*violated*

•  In*certain*cases,*LDA*has*poor*results*Linear Discriminant Analysis

LDA applied to simulated data sets. Left: The true within classdensities are Gaussian with identical covariance matrices acrossclasses. Right: The true within class densities are mixtures of twoGaussians.

Jia Li http://www.stat.psu.edu/⇠jiali

Linear Discriminant Analysis

LDA applied to simulated data sets. Left: The true within classdensities are Gaussian with identical covariance matrices acrossclasses. Right: The true within class densities are mixtures of twoGaussians.


Quadra/c*discriminant*analysis*




Quadratic Discriminant Analysis (QDA)

I Estimate the covariance matrix ⌃k

separately for each class k,k = 1, 2, ...,K .

IQuadratic discriminant function:

�k

(x) = �1

2log |⌃

k

|� 1

2(x � µ

k

)T⌃�1

k

(x � µk

) + log ⇡k

.

I Classification rule:

G (x) = arg maxk

�k

(x) .

I Decision boundaries are quadratic equations in x .

I QDA fits the data better than LDA, but has more parametersto estimate.







�k

(x) = �1

2log |⌃

k

|� 1

2(x � µ

k

)T⌃�1

k

(x � µk

) + log ⇡k

.


G (x) = arg maxk

�k

(x) .









�k

(x) = �1

2log |⌃

k

|� 1

2(x � µ

k

)T⌃�1

k

(x � µk

) + log ⇡k

.


G (x) = arg maxk

�k

(x) .




Viola/on*of*assump/on*Linear Discriminant Analysis

Left: Decision boundaries by LDA. Right: Decision boundariesobtained by modeling each class by a mixture of two Gaussians.


Mixture*discriminant*analysis*

Mixture Discriminant Analysis


IA method for classification (supervised) based on mixture

models.

IExtension of linear discriminant analysis

IThe mixture of normals is used to obtain a density estimation

for each class.





models.



for each class.





models.



for each class.




IA single Gaussian to model a class, as in LDA, is too

restricted.

IExtend to a mixture of Gaussians. For class k, the

within-class density is:

fk(x) =

RkX

r=1

⇡kr�(x |µkr ,⌃)

IA common covariance matrix is still assumed.


Mixture*of*Gaussian*models*Mixture Discriminant Analysis

A 2-classes example. Class 1 is a mixture of 3 normals and class 2

a mixture of 2 normals. The variances for all the normals are 3.0.


Documents

Discriminant*analysis*and* classiﬁca/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classiﬁer is not realistic as it requires

Discriminantanalysisand* classiﬁca/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classiﬁer is not realistic as it requires