19
Discriminant analysis and classifica/on ISyE 6416, Spring 2015 Yao Xie

Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Discriminant*analysis*and*classifica/on**

ISyE*6416,*Spring*2015*Yao*Xie*

Page 2: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Classification

Classification is a predictive task in which the response takesvalues across discrete categories (i.e., not continuous), and in themost fundamental case, two categories

Examples:

I Predicting whether a patient will develop breast cancer orremain healthy, given genetic information

I Predicting whether or not a user will like a new product, basedon user covariates and a history of his/her previous ratings

I Predicting the region of Italy in which a brand of olive oil wasmade, based on its chemical composition

I Predicting the next elected president, based on various social,political, and historical measurements

2

Page 3: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Constructed from training data (xi, yi), i = 1, . . . n, we denote ourclassification rule by ˆ

f(x); given any x 2 Rp, this returns a classlabel ˆ

f(x) 2 {1, . . .K}

As before, we will see that there are two di↵erent ways of assessingthe quality of ˆ

f : its predictive ability and interpretative ability

E.g., train on (xi, yi), i = 1, . . . n,the data of elected presidentsand related feature measurementsxi 2 Rp for the past n elections,and predict, given the current fea-ture measurements x0 2 Rp, thewinner of the current election

In what situations would we care more about prediction error? Andin what situations more about interpretation?

4

Constructed from training data (xi, yi), i = 1, . . . n, we denote ourclassification rule by ˆ

f(x); given any x 2 Rp, this returns a classlabel ˆ

f(x) 2 {1, . . .K}

As before, we will see that there are two di↵erent ways of assessingthe quality of ˆ

f : its predictive ability and interpretative ability

E.g., train on (xi, yi), i = 1, . . . n,the data of elected presidentsand related feature measurementsxi 2 Rp for the past n elections,and predict, given the current fea-ture measurements x0 2 Rp, thewinner of the current election

In what situations would we care more about prediction error? Andin what situations more about interpretation?

4

Page 4: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires
Page 5: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Linear discriminant analysis

Using the Bayes classifier is not realistic as it requires knowing theclass conditional densities P(X = x|C = j) and prior probabilities⇡j . But if estimate these quantities, then we can follow the idea

Linear discriminant analysis (LDA) does this by assuming that thedata within each class are normally distributed:

hj(x) = P(X = x|C = j) = N(µj ,⌃) density

We allow each class to have its own mean µj 2 Rp, but we assumea common covariance matrix ⌃ 2 Rp⇥p. Hence

hj(x) =1

(2⇡)

p/2det(⌃)

1/2exp

n

� 1

2

(x� µj)T⌃

�1(x� µj)

o

So we want to find j so that P(C = j|X = x) · ⇡j = hj(x) · ⇡j isthe largest

14

Linear discriminant analysis

Using the Bayes classifier is not realistic as it requires knowing theclass conditional densities P(X = x|C = j) and prior probabilities⇡j . But if estimate these quantities, then we can follow the idea

Linear discriminant analysis (LDA) does this by assuming that thedata within each class are normally distributed:

hj(x) = P(X = x|C = j) = N(µj ,⌃) density

We allow each class to have its own mean µj 2 Rp, but we assumea common covariance matrix ⌃ 2 Rp⇥p. Hence

hj(x) =1

(2⇡)

p/2det(⌃)

1/2exp

n

� 1

2

(x� µj)T⌃

�1(x� µj)

o

So we want to find j so that P(C = j|X = x) · ⇡j = hj(x) · ⇡j isthe largest

14

Page 6: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Since log(·) is a monotone function, we can consider maximizinglog(hj(x)⇡j) over j = 1, . . .K. We can define the rule:

f

LDA(x) = argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K�j(x)

We call �j(x), j = 1, . . .K the discriminant functions. Note

�j(x) = x

T⌃

�1µj �

1

2

µ

Tj ⌃

�1µj + log ⇡j

is just an a�ne function of x

15

Since log(·) is a monotone function, we can consider maximizinglog(hj(x)⇡j) over j = 1, . . .K. We can define the rule:

f

LDA(x) = argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K�j(x)

We call �j(x), j = 1, . . .K the discriminant functions. Note

�j(x) = x

T⌃

�1µj �

1

2

µ

Tj ⌃

�1µj + log ⇡j

is just an a�ne function of x

15

Since log(·) is a monotone function, we can consider maximizinglog(hj(x)⇡j) over j = 1, . . .K. We can define the rule:

f

LDA(x) = argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K

= argmax

j=1,...K�j(x)

We call �j(x), j = 1, . . .K the discriminant functions. Note

�j(x) = x

T⌃

�1µj �

1

2

µ

Tj ⌃

�1µj + log ⇡j

is just an a�ne function of x

15

Page 7: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

In practice, given an input x 2 Rp, can we just use the rule f

LDA

on the previous slide? Not quite! What’s missing: we don’t know⇡j , µj , and ⌃. Therefore we estimate them based on the trainingdata xi 2 Rp and yi 2 {1, . . .K}, i = 1, . . . n, by:

I⇡j = nj/n, the proportion of observations in class j

Iµj =

1nj

P

yi=j xi, the centroid of class j

I ˆ

⌃ =

1n�K

PKj=1

P

yi=j(xi � µj)(xi � µj)T , the pooled sample

covariance matrix

(Here nj is the number of points in class j)

This gives the estimated discriminant functions:

ˆ

�j(x) = x

�1µj �

1

2

µ

Tjˆ

�1µj + log ⇡j

and finally the linear discriminant analysis rule,

ˆ

f

LDA(x) = argmax

j=1,...K

ˆ

�j(x)

16

Page 8: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

LDA decision boundaries

The estimated discriminant functions

ˆ

�j(x) = x

�1µj �

1

2

µ

Tjˆ

�1µj + log ⇡j

= aj + b

Tj x

are just a�ne functions of x. The decision boundary betweenclasses j, k is the set of all x 2 Rp such that ˆ�j(x) = ˆ

�k(x), i.e.,

aj + b

Tj x = ak + b

Tk x

This defines an a�ne subspacein x:

aj � ak + (bj � bk)Tx = 0

17

Page 9: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Example: LDA decision boundaries

Example of decision boundaries from LDA (from ESL page 109):

f

LDA(x)

ˆ

f

LDA(x)

Are the decision boundaries the same as the perpendicular bisectors(Voronoi boundaries) between the class centroids? (Why not?)

18

Page 10: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

The*Iris*flower*data*set*or*Fisher's*Iris*data*set*is*a*mul/variate*data*set*introduced*by*Sir*Ronald*Fisher*(1936)*as*an*example*of*discriminant*analysis.[1]**

The*data*set*consists*of*50*samples*from*each*of*three*species*of*Iris*(Iris*setosa,*Iris*virginica*and*Iris*versicolor).*Four*features*were*measured*from*each*sample:*the*length*and*the*width*of*the*sepals*and*petals,*in*cen/metres.***Based*on*the*combina/on*of*these*four*features,*Fisher*developed*a*linear*discriminant*model*to*dis/nguish*the*species*from*each*other.*

Page 11: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

LDA computations, usages, extensions

The decision boundaries for LDA are useful for graphical purposes,but to classify a new point x0 2 Rp we don’t use them—we simplycompute ˆ

�j(x0) for each j = 1, . . .K

LDA performs quite well on a wide variety of data sets, even whenpitted against fancy alternative classification schemes. Though itassumes normality, its simplicity often works in its favor. (Why?Think of the bias-variance tradeo↵)

Still, there are some useful extensions of LDA. E.g.,

I Quadratic discriminant analysis: using the same normalmodel, we now allow each class j to have its own covariancematrix ⌃j . This leads to quadratic decision boundaries

I Reduced-rank linear discriminant analysis: we essentiallyproject the data to a lower dimensional subspace beforeperforming LDA. We will study this next time

19

LDA computations, usages, extensions

The decision boundaries for LDA are useful for graphical purposes,but to classify a new point x0 2 Rp we don’t use them—we simplycompute ˆ

�j(x0) for each j = 1, . . .K

LDA performs quite well on a wide variety of data sets, even whenpitted against fancy alternative classification schemes. Though itassumes normality, its simplicity often works in its favor. (Why?Think of the bias-variance tradeo↵)

Still, there are some useful extensions of LDA. E.g.,

I Quadratic discriminant analysis: using the same normalmodel, we now allow each class j to have its own covariancematrix ⌃j . This leads to quadratic decision boundaries

I Reduced-rank linear discriminant analysis: we essentiallyproject the data to a lower dimensional subspace beforeperforming LDA. We will study this next time

19

Page 12: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

LDA computations and sphering

Note that LDA equivalently minimizes over j = 1, . . .K,

1

2

(x� µ

j

)

T

ˆ

�1(x� µ

j

)� log ⇡

j

It helps to factorize ˆ

⌃ (i.e., compute its eigendecomposition):

ˆ

⌃ = UDU

T

where U 2 Rp⇥p has orthonormal columns (and rows), andD = diag(d1, . . . dp) with d

j

� 0 for each j. Then we haveˆ

�1= UD

�1U

T , and

(x� µ

j

)

T

ˆ

�1(x� µ

j

) = kD�1/2U

T

x

j

| {z }

x

�D

�1/2U

T

µ

j

| {z }

µ

j

k22

This is just the squared distance between x and µ

j

4

Page 13: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Hence the LDA procedure can be described as:

1. Compute the sample estimates ⇡j

, µ

j

,

ˆ

2. Factor ˆ

⌃, as in ˆ

⌃ = UDU

T

3. Transform the class centroids µj

= D

�1/2U

T

µ

j

4. Given any point x 2 Rp, transform to x = D

�1/2U

T

x 2 Rp,and then classify according to the nearest centroid in thetransformed space, adjusting for class proportions—this is theclass j for which 1

2kx� µ

j

k22 � log ⇡

j

is smallest

What is this transformation doing? Think about applying it to theobservations:

x

i

= D

�1/2U

T

x

i

, i = 1, . . . n

This is basically sphering the data points, because if we think ofx 2 Rp were a random variable with covariance matrix ˆ

⌃, then

Cov(D

�1/2U

T

x) = D

�1/2U

T

ˆ

⌃UD

�1/2= I

5

Page 14: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Generally speaking, logistic regression is more flexible because itdoesn’t assume anything about the distribution of X. LDAassumes that X is normally distributed within each class, so thatits marginal distribution is a mixture of normal distributions, hencestill normal:

X ⇠KX

j=1

⇡jN(µj ,⌃)

This means that logistic regression is more robust to situations inwhich the class conditional densities are not normal (and outliers

On the other side, if the true class conditional densities are normal,or close to it, LDA will be more e�cient, meaning that for logisticregression to perform comparably it will need more data

In practice they tend to perform similarly in a variety of situations(as claimed by the ESL book on page 128)

10

Page 15: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

•  LDA*is*not*necessarily*bad*when*assump/ons*about*the*density*func/ons*are*violated*

•  In*certain*cases,*LDA*has*poor*results*Linear Discriminant Analysis

LDA applied to simulated data sets. Left: The true within classdensities are Gaussian with identical covariance matrices acrossclasses. Right: The true within class densities are mixtures of twoGaussians.

Jia Li http://www.stat.psu.edu/⇠jiali

Linear Discriminant Analysis

LDA applied to simulated data sets. Left: The true within classdensities are Gaussian with identical covariance matrices acrossclasses. Right: The true within class densities are mixtures of twoGaussians.

Jia Li http://www.stat.psu.edu/⇠jiali

Page 16: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Quadra/c*discriminant*analysis*

Linear Discriminant Analysis

Jia Li http://www.stat.psu.edu/⇠jiali

Linear Discriminant Analysis

Quadratic Discriminant Analysis (QDA)

I Estimate the covariance matrix ⌃k

separately for each class k,k = 1, 2, ...,K .

IQuadratic discriminant function:

�k

(x) = �1

2log |⌃

k

|� 1

2(x � µ

k

)T⌃�1

k

(x � µk

) + log ⇡k

.

I Classification rule:

G (x) = arg maxk

�k

(x) .

I Decision boundaries are quadratic equations in x .

I QDA fits the data better than LDA, but has more parametersto estimate.

Jia Li http://www.stat.psu.edu/⇠jiali

Linear Discriminant Analysis

Quadratic Discriminant Analysis (QDA)

I Estimate the covariance matrix ⌃k

separately for each class k,k = 1, 2, ...,K .

IQuadratic discriminant function:

�k

(x) = �1

2log |⌃

k

|� 1

2(x � µ

k

)T⌃�1

k

(x � µk

) + log ⇡k

.

I Classification rule:

G (x) = arg maxk

�k

(x) .

I Decision boundaries are quadratic equations in x .

I QDA fits the data better than LDA, but has more parametersto estimate.

Jia Li http://www.stat.psu.edu/⇠jiali

Linear Discriminant Analysis

Quadratic Discriminant Analysis (QDA)

I Estimate the covariance matrix ⌃k

separately for each class k,k = 1, 2, ...,K .

IQuadratic discriminant function:

�k

(x) = �1

2log |⌃

k

|� 1

2(x � µ

k

)T⌃�1

k

(x � µk

) + log ⇡k

.

I Classification rule:

G (x) = arg maxk

�k

(x) .

I Decision boundaries are quadratic equations in x .

I QDA fits the data better than LDA, but has more parametersto estimate.

Jia Li http://www.stat.psu.edu/⇠jiali

Page 17: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Viola/on*of*assump/on*Linear Discriminant Analysis

Left: Decision boundaries by LDA. Right: Decision boundariesobtained by modeling each class by a mixture of two Gaussians.

Jia Li http://www.stat.psu.edu/⇠jiali

Page 18: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Mixture*discriminant*analysis*

Mixture Discriminant Analysis

Mixture Discriminant Analysis

IA method for classification (supervised) based on mixture

models.

IExtension of linear discriminant analysis

IThe mixture of normals is used to obtain a density estimation

for each class.

Jia Li http://www.stat.psu.edu/⇠jiali

Mixture Discriminant Analysis

Mixture Discriminant Analysis

IA method for classification (supervised) based on mixture

models.

IExtension of linear discriminant analysis

IThe mixture of normals is used to obtain a density estimation

for each class.

Jia Li http://www.stat.psu.edu/⇠jiali

Mixture Discriminant Analysis

Mixture Discriminant Analysis

IA method for classification (supervised) based on mixture

models.

IExtension of linear discriminant analysis

IThe mixture of normals is used to obtain a density estimation

for each class.

Jia Li http://www.stat.psu.edu/⇠jiali

Mixture Discriminant Analysis

Mixture Discriminant Analysis

IA single Gaussian to model a class, as in LDA, is too

restricted.

IExtend to a mixture of Gaussians. For class k, the

within-class density is:

fk(x) =

RkX

r=1

⇡kr�(x |µkr ,⌃)

IA common covariance matrix is still assumed.

Jia Li http://www.stat.psu.edu/⇠jiali

Page 19: Discriminant*analysis*and* classifica/on**yxie77/isye6416/Lecture5.pdf · 2015. 2. 12. · Linear discriminant analysis Using the Bayes classifier is not realistic as it requires

Mixture*of*Gaussian*models*Mixture Discriminant Analysis

A 2-classes example. Class 1 is a mixture of 3 normals and class 2

a mixture of 2 normals. The variances for all the normals are 3.0.

Jia Li http://www.stat.psu.edu/⇠jiali