Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Discriminant*analysis*and*classifica/on**
ISyE*6416,*Spring*2015*Yao*Xie*
Classification
Classification is a predictive task in which the response takesvalues across discrete categories (i.e., not continuous), and in themost fundamental case, two categories
Examples:
I Predicting whether a patient will develop breast cancer orremain healthy, given genetic information
I Predicting whether or not a user will like a new product, basedon user covariates and a history of his/her previous ratings
I Predicting the region of Italy in which a brand of olive oil wasmade, based on its chemical composition
I Predicting the next elected president, based on various social,political, and historical measurements
2
Constructed from training data (xi, yi), i = 1, . . . n, we denote ourclassification rule by ˆ
f(x); given any x 2 Rp, this returns a classlabel ˆ
f(x) 2 {1, . . .K}
As before, we will see that there are two di↵erent ways of assessingthe quality of ˆ
f : its predictive ability and interpretative ability
E.g., train on (xi, yi), i = 1, . . . n,the data of elected presidentsand related feature measurementsxi 2 Rp for the past n elections,and predict, given the current fea-ture measurements x0 2 Rp, thewinner of the current election
In what situations would we care more about prediction error? Andin what situations more about interpretation?
4
Constructed from training data (xi, yi), i = 1, . . . n, we denote ourclassification rule by ˆ
f(x); given any x 2 Rp, this returns a classlabel ˆ
f(x) 2 {1, . . .K}
As before, we will see that there are two di↵erent ways of assessingthe quality of ˆ
f : its predictive ability and interpretative ability
E.g., train on (xi, yi), i = 1, . . . n,the data of elected presidentsand related feature measurementsxi 2 Rp for the past n elections,and predict, given the current fea-ture measurements x0 2 Rp, thewinner of the current election
In what situations would we care more about prediction error? Andin what situations more about interpretation?
4
Linear discriminant analysis
Using the Bayes classifier is not realistic as it requires knowing theclass conditional densities P(X = x|C = j) and prior probabilities⇡j . But if estimate these quantities, then we can follow the idea
Linear discriminant analysis (LDA) does this by assuming that thedata within each class are normally distributed:
hj(x) = P(X = x|C = j) = N(µj ,⌃) density
We allow each class to have its own mean µj 2 Rp, but we assumea common covariance matrix ⌃ 2 Rp⇥p. Hence
hj(x) =1
(2⇡)
p/2det(⌃)
1/2exp
n
� 1
2
(x� µj)T⌃
�1(x� µj)
o
So we want to find j so that P(C = j|X = x) · ⇡j = hj(x) · ⇡j isthe largest
14
Linear discriminant analysis
Using the Bayes classifier is not realistic as it requires knowing theclass conditional densities P(X = x|C = j) and prior probabilities⇡j . But if estimate these quantities, then we can follow the idea
Linear discriminant analysis (LDA) does this by assuming that thedata within each class are normally distributed:
hj(x) = P(X = x|C = j) = N(µj ,⌃) density
We allow each class to have its own mean µj 2 Rp, but we assumea common covariance matrix ⌃ 2 Rp⇥p. Hence
hj(x) =1
(2⇡)
p/2det(⌃)
1/2exp
n
� 1
2
(x� µj)T⌃
�1(x� µj)
o
So we want to find j so that P(C = j|X = x) · ⇡j = hj(x) · ⇡j isthe largest
14
Since log(·) is a monotone function, we can consider maximizinglog(hj(x)⇡j) over j = 1, . . .K. We can define the rule:
f
LDA(x) = argmax
j=1,...K
= argmax
j=1,...K
= argmax
j=1,...K
= argmax
j=1,...K�j(x)
We call �j(x), j = 1, . . .K the discriminant functions. Note
�j(x) = x
T⌃
�1µj �
1
2
µ
Tj ⌃
�1µj + log ⇡j
is just an a�ne function of x
15
Since log(·) is a monotone function, we can consider maximizinglog(hj(x)⇡j) over j = 1, . . .K. We can define the rule:
f
LDA(x) = argmax
j=1,...K
= argmax
j=1,...K
= argmax
j=1,...K
= argmax
j=1,...K�j(x)
We call �j(x), j = 1, . . .K the discriminant functions. Note
�j(x) = x
T⌃
�1µj �
1
2
µ
Tj ⌃
�1µj + log ⇡j
is just an a�ne function of x
15
Since log(·) is a monotone function, we can consider maximizinglog(hj(x)⇡j) over j = 1, . . .K. We can define the rule:
f
LDA(x) = argmax
j=1,...K
= argmax
j=1,...K
= argmax
j=1,...K
= argmax
j=1,...K�j(x)
We call �j(x), j = 1, . . .K the discriminant functions. Note
�j(x) = x
T⌃
�1µj �
1
2
µ
Tj ⌃
�1µj + log ⇡j
is just an a�ne function of x
15
In practice, given an input x 2 Rp, can we just use the rule f
LDA
on the previous slide? Not quite! What’s missing: we don’t know⇡j , µj , and ⌃. Therefore we estimate them based on the trainingdata xi 2 Rp and yi 2 {1, . . .K}, i = 1, . . . n, by:
I⇡j = nj/n, the proportion of observations in class j
Iµj =
1nj
P
yi=j xi, the centroid of class j
I ˆ
⌃ =
1n�K
PKj=1
P
yi=j(xi � µj)(xi � µj)T , the pooled sample
covariance matrix
(Here nj is the number of points in class j)
This gives the estimated discriminant functions:
ˆ
�j(x) = x
Tˆ
⌃
�1µj �
1
2
µ
Tjˆ
⌃
�1µj + log ⇡j
and finally the linear discriminant analysis rule,
ˆ
f
LDA(x) = argmax
j=1,...K
ˆ
�j(x)
16
LDA decision boundaries
The estimated discriminant functions
ˆ
�j(x) = x
Tˆ
⌃
�1µj �
1
2
µ
Tjˆ
⌃
�1µj + log ⇡j
= aj + b
Tj x
are just a�ne functions of x. The decision boundary betweenclasses j, k is the set of all x 2 Rp such that ˆ�j(x) = ˆ
�k(x), i.e.,
aj + b
Tj x = ak + b
Tk x
This defines an a�ne subspacein x:
aj � ak + (bj � bk)Tx = 0
17
Example: LDA decision boundaries
Example of decision boundaries from LDA (from ESL page 109):
f
LDA(x)
ˆ
f
LDA(x)
Are the decision boundaries the same as the perpendicular bisectors(Voronoi boundaries) between the class centroids? (Why not?)
18
The*Iris*flower*data*set*or*Fisher's*Iris*data*set*is*a*mul/variate*data*set*introduced*by*Sir*Ronald*Fisher*(1936)*as*an*example*of*discriminant*analysis.[1]**
The*data*set*consists*of*50*samples*from*each*of*three*species*of*Iris*(Iris*setosa,*Iris*virginica*and*Iris*versicolor).*Four*features*were*measured*from*each*sample:*the*length*and*the*width*of*the*sepals*and*petals,*in*cen/metres.***Based*on*the*combina/on*of*these*four*features,*Fisher*developed*a*linear*discriminant*model*to*dis/nguish*the*species*from*each*other.*
LDA computations, usages, extensions
The decision boundaries for LDA are useful for graphical purposes,but to classify a new point x0 2 Rp we don’t use them—we simplycompute ˆ
�j(x0) for each j = 1, . . .K
LDA performs quite well on a wide variety of data sets, even whenpitted against fancy alternative classification schemes. Though itassumes normality, its simplicity often works in its favor. (Why?Think of the bias-variance tradeo↵)
Still, there are some useful extensions of LDA. E.g.,
I Quadratic discriminant analysis: using the same normalmodel, we now allow each class j to have its own covariancematrix ⌃j . This leads to quadratic decision boundaries
I Reduced-rank linear discriminant analysis: we essentiallyproject the data to a lower dimensional subspace beforeperforming LDA. We will study this next time
19
LDA computations, usages, extensions
The decision boundaries for LDA are useful for graphical purposes,but to classify a new point x0 2 Rp we don’t use them—we simplycompute ˆ
�j(x0) for each j = 1, . . .K
LDA performs quite well on a wide variety of data sets, even whenpitted against fancy alternative classification schemes. Though itassumes normality, its simplicity often works in its favor. (Why?Think of the bias-variance tradeo↵)
Still, there are some useful extensions of LDA. E.g.,
I Quadratic discriminant analysis: using the same normalmodel, we now allow each class j to have its own covariancematrix ⌃j . This leads to quadratic decision boundaries
I Reduced-rank linear discriminant analysis: we essentiallyproject the data to a lower dimensional subspace beforeperforming LDA. We will study this next time
19
LDA computations and sphering
Note that LDA equivalently minimizes over j = 1, . . .K,
1
2
(x� µ
j
)
T
ˆ
⌃
�1(x� µ
j
)� log ⇡
j
It helps to factorize ˆ
⌃ (i.e., compute its eigendecomposition):
ˆ
⌃ = UDU
T
where U 2 Rp⇥p has orthonormal columns (and rows), andD = diag(d1, . . . dp) with d
j
� 0 for each j. Then we haveˆ
⌃
�1= UD
�1U
T , and
(x� µ
j
)
T
ˆ
⌃
�1(x� µ
j
) = kD�1/2U
T
x
j
| {z }
x
�D
�1/2U
T
µ
j
| {z }
µ
j
k22
This is just the squared distance between x and µ
j
4
Hence the LDA procedure can be described as:
1. Compute the sample estimates ⇡j
, µ
j
,
ˆ
⌃
2. Factor ˆ
⌃, as in ˆ
⌃ = UDU
T
3. Transform the class centroids µj
= D
�1/2U
T
µ
j
4. Given any point x 2 Rp, transform to x = D
�1/2U
T
x 2 Rp,and then classify according to the nearest centroid in thetransformed space, adjusting for class proportions—this is theclass j for which 1
2kx� µ
j
k22 � log ⇡
j
is smallest
What is this transformation doing? Think about applying it to theobservations:
x
i
= D
�1/2U
T
x
i
, i = 1, . . . n
This is basically sphering the data points, because if we think ofx 2 Rp were a random variable with covariance matrix ˆ
⌃, then
Cov(D
�1/2U
T
x) = D
�1/2U
T
ˆ
⌃UD
�1/2= I
5
Generally speaking, logistic regression is more flexible because itdoesn’t assume anything about the distribution of X. LDAassumes that X is normally distributed within each class, so thatits marginal distribution is a mixture of normal distributions, hencestill normal:
X ⇠KX
j=1
⇡jN(µj ,⌃)
This means that logistic regression is more robust to situations inwhich the class conditional densities are not normal (and outliers
On the other side, if the true class conditional densities are normal,or close to it, LDA will be more e�cient, meaning that for logisticregression to perform comparably it will need more data
In practice they tend to perform similarly in a variety of situations(as claimed by the ESL book on page 128)
10
• LDA*is*not*necessarily*bad*when*assump/ons*about*the*density*func/ons*are*violated*
• In*certain*cases,*LDA*has*poor*results*Linear Discriminant Analysis
LDA applied to simulated data sets. Left: The true within classdensities are Gaussian with identical covariance matrices acrossclasses. Right: The true within class densities are mixtures of twoGaussians.
Jia Li http://www.stat.psu.edu/⇠jiali
Linear Discriminant Analysis
LDA applied to simulated data sets. Left: The true within classdensities are Gaussian with identical covariance matrices acrossclasses. Right: The true within class densities are mixtures of twoGaussians.
Jia Li http://www.stat.psu.edu/⇠jiali
Quadra/c*discriminant*analysis*
Linear Discriminant Analysis
Jia Li http://www.stat.psu.edu/⇠jiali
Linear Discriminant Analysis
Quadratic Discriminant Analysis (QDA)
I Estimate the covariance matrix ⌃k
separately for each class k,k = 1, 2, ...,K .
IQuadratic discriminant function:
�k
(x) = �1
2log |⌃
k
|� 1
2(x � µ
k
)T⌃�1
k
(x � µk
) + log ⇡k
.
I Classification rule:
G (x) = arg maxk
�k
(x) .
I Decision boundaries are quadratic equations in x .
I QDA fits the data better than LDA, but has more parametersto estimate.
Jia Li http://www.stat.psu.edu/⇠jiali
Linear Discriminant Analysis
Quadratic Discriminant Analysis (QDA)
I Estimate the covariance matrix ⌃k
separately for each class k,k = 1, 2, ...,K .
IQuadratic discriminant function:
�k
(x) = �1
2log |⌃
k
|� 1
2(x � µ
k
)T⌃�1
k
(x � µk
) + log ⇡k
.
I Classification rule:
G (x) = arg maxk
�k
(x) .
I Decision boundaries are quadratic equations in x .
I QDA fits the data better than LDA, but has more parametersto estimate.
Jia Li http://www.stat.psu.edu/⇠jiali
Linear Discriminant Analysis
Quadratic Discriminant Analysis (QDA)
I Estimate the covariance matrix ⌃k
separately for each class k,k = 1, 2, ...,K .
IQuadratic discriminant function:
�k
(x) = �1
2log |⌃
k
|� 1
2(x � µ
k
)T⌃�1
k
(x � µk
) + log ⇡k
.
I Classification rule:
G (x) = arg maxk
�k
(x) .
I Decision boundaries are quadratic equations in x .
I QDA fits the data better than LDA, but has more parametersto estimate.
Jia Li http://www.stat.psu.edu/⇠jiali
Viola/on*of*assump/on*Linear Discriminant Analysis
Left: Decision boundaries by LDA. Right: Decision boundariesobtained by modeling each class by a mixture of two Gaussians.
Jia Li http://www.stat.psu.edu/⇠jiali
Mixture*discriminant*analysis*
Mixture Discriminant Analysis
Mixture Discriminant Analysis
IA method for classification (supervised) based on mixture
models.
IExtension of linear discriminant analysis
IThe mixture of normals is used to obtain a density estimation
for each class.
Jia Li http://www.stat.psu.edu/⇠jiali
Mixture Discriminant Analysis
Mixture Discriminant Analysis
IA method for classification (supervised) based on mixture
models.
IExtension of linear discriminant analysis
IThe mixture of normals is used to obtain a density estimation
for each class.
Jia Li http://www.stat.psu.edu/⇠jiali
Mixture Discriminant Analysis
Mixture Discriminant Analysis
IA method for classification (supervised) based on mixture
models.
IExtension of linear discriminant analysis
IThe mixture of normals is used to obtain a density estimation
for each class.
Jia Li http://www.stat.psu.edu/⇠jiali
Mixture Discriminant Analysis
Mixture Discriminant Analysis
IA single Gaussian to model a class, as in LDA, is too
restricted.
IExtend to a mixture of Gaussians. For class k, the
within-class density is:
fk(x) =
RkX
r=1
⇡kr�(x |µkr ,⌃)
IA common covariance matrix is still assumed.
Jia Li http://www.stat.psu.edu/⇠jiali
Mixture*of*Gaussian*models*Mixture Discriminant Analysis
A 2-classes example. Class 1 is a mixture of 3 normals and class 2
a mixture of 2 normals. The variances for all the normals are 3.0.
Jia Li http://www.stat.psu.edu/⇠jiali