Linear Discriminant Analysis and Its Generalization

Linear Discriminant Analysis and Its GeneralizationChapter 4 and 12 of The Elements of Statistical Learning

Presented by Ilsang Ohn

Department of Statistics, Seoul National University

September 3, 2014

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 1 / 33

Contents

1 Linear Discriminant Analysis

2 Flexible Discriminant Analysis

3 Penalized Discriminant Analysis

4 Mixture Discriminant Analysis


Review ofLinear Discriminant Analysis


LDA: Overview

• Linear discriminant analysis (LDA) does classification by assumingthat the data within each class are normally distributed:

fk(x) = P (X = x|G = k) = N(µk,Σ).

• We allow each class to have its own mean µk ∈ Rp, but we assume acommon variance matrix Σ ∈ Rp×p. Thus

fk(x) =1

(2π)p/2|Σ|1/2exp

−1

2(x− µk)TΣ−1(x− µk)

.

• We want to find k so that P (G = k|X = x) ∝ fk(x)πk is the largest.


LDA: Overview

• The linear discriminant functions are derived from the relation

log(fk(x)πk) = −1

2(x− µk)TΣ−1(x− µk) + log(πk) + C

= xTΣ−1µk −1

2µTk Σ−1µk + log(πk) + C ′,

and we denote

δk(x) = xTΣ−1µk −1

2µTk Σ−1µk + log(πk).

• The decision rule is G(x) = argmaxkδk(x).

• The Bayes classifier is a linear classifier.


LDA: Overview

• We need to estimate the parameters based on the training dataxi ∈ Rp and yi ∈ 1, · · · ,K by

• πk = Nk/N• µk = N−1k

∑yi=k

xi, the centroid of class k

• Σ = 1N−K

∑Kk=1

∑yi=k

(xi − µk)(xi − µk)T , the pooled samplevariance matrix

• The decision boundary between each pair of classes k and l is given by

x : δk(x) = δl(x)

which is equivalent to

(µk − µl)T Σ−1x =1

2(µk + µl)

T Σ−1(µk − µl)− log(πk/πl).


Fisher’s discriminant analysis

• Fisher’s idea is to find a covariate v such that

maxvvTBv/vTWv.

where

- B =∑K

k=1(xk − x)(xk − x)T : between-class covariance matrix

- W =∑K

k=1

∑yi=k(xi − xk)(xi − xk)T : within-class covariance

matrix, previously denoted by (N −K)Σ

• This ratio is maximized by v1 = e1, which is the eigenvector ofW−1B with the largest eigenvalue. The linear combination vT1 X iscalled first discriminant. Similarly one can find the next direction v2orthogonal in W to v1.

• Fisher’s canonical discriminant analysis finds L ≤ K − 1 canonicalcoordinates (or a rank-L subspace) that best separate the categories.


Fisher’s discriminant analysis

• Consequently, we have v1, . . . , vL, L ≤ K − 1, which is theeigenvectors with non-zero eigenvalues.

• Fisher’s discriminant rule assigns to the class closest in Mahalanobisdistance, so the rule is given by

G′(x) = argmink

L∑l=1

[vTl (x− xk)]2

= argmink

(x− xk)T Σ−1(x− xk)

= argmink

(−2δk(x) + xT Σ−1x+ 2 log πk)

= argmaxk

(δk(x)− log πk).

• Thus Fisher’s rule is equivalent to the Gaussian classification rule withequal prior probabilities.


LDA by optimal scoring

• The standard way of carrying out a (Fisher’s) canonical discriminantanalysis is by way of a suitable SVD.

• There is a somewhat different approach: optimal scoring.

• This method is performing LDA using linear regression on derivedresponses.



• Recall G = 1, · · · ,K.• θ : G → R is a function that assigns scores to the classes such that

the transformed class labels are optimally predicted by linearregression on X.

• We find L ≤ K − 1 sets of independent scorings for the class labelsθ1, · · · , θL, and L corresponding linear maps ηl(X) = XTβl chosento be optimal for multiple regression in Rl.

• θl and βl are chosen to minimize

ASR =1

N

L∑l=1

[N∑i=1

(θl(gi)− xTi βl)2].



Notation

• Y : N ×K indicator matrix

• PX = X(XTX)−1XT : projection matrix onto the column space ofthe predictors

• Θ: K × L matrix of L score vectors for the K classes.

• Θ∗ = YΘ: N ×K matrix with Θ∗ij = θj(gi).



Problem

• Minimize ASR by regressing Θ∗ on X. This says that find Θ thatminimizes

ASR(Θ) = tr(Θ∗T (I − PX)Θ∗)/N = tr(ΘTY T (I − PX)YΘ)/N

• ASR(Θ) is minimized by finding the L largest eigenvectors Θ ofY TPXY with normalization ΘTDpΘ = IL.

• Hear Dp = Y TY/N is a diagonal matrix of the sample classproportions Nj/N .



Way to the solution

1. Initialize: Form Y : N ×K.

2. Multivariate regression: Set Y = PXY and denote the p×Kcoefficient matrix by B: Y = XB.

3. Optimal scores: Obtain the eigenvector matrix Θ of Y T Y = Y TPXYwith normalization ΘTDPΘ = I.

4. Update: Update the coefficient matrix in step 2 to reflect the optimalscores: B ← BΘ. The final optimally scaled regression fit is the K − 1vector function η(x) = BTx.



• The sequence of discriminant vectors νl in LDA are identical to thesequence βl up to a constant.

• That is, the coefficient matrix B is, up to a diagonal scale matrix, thesame as the discriminant analysis coefficient matrix,

V Tx = DBTx = Dη(x)

where Dll = 1/[α2l (1− α2

l )] and x is a test point. Here αl is lthlargest eigenvalue of Θ.

• Then the Mahalanobis distance is given by

δJ(x, µk) =

K−1∑l=1

wl(ηl(x)− ηkl )2 +D(x)

where ηkl = N−1k∑nk

i=1 ηl(xi) and wl = 1/[α2l (1− α2

l )].


Generalization of LDA

• FDA: Allow non-linear decision boundary

• PDA: Expand the predictors into a large basis set, and then penalizeits coefficients to be smooth

• MDA: Model each class by a mixture of two or more Gaussians withdifferent centroids but same covariance, rather than a single Gaussiandistribution as in LDA


Flexible Discriminant Analysis(Hastie et al., 1994)


FDA: Overview

• Optimal scoring method provides a starting point for generalizingLDA to a nonparametric version.

• We replace the linear projection operator PX by a nonparametricregression procedure, which we denote by the linear operator S.

• One simple and effective approach toward this end is to expand Xinto a larger set of basis variables h(X) and then simply useS = Ph(X) in place of PX .


FDA: Overview

• This regression problems are defined via the criterion

ASR(θl, ηlLl=1) =1

N

L∑l=1

[N∑i=1

(θl(gi)− ηl(xi))2 + λJ(ηl)

],

where J is a regularizer appropriate for some forms of nonparametricregression (e.g., smoothing splines, additive splines and lower-orderANOVA models).


FDA by optimal scoring

Way to the solution

1. Initialize: Form Y : N ×K.

2. Multivariate nonparametric regression: Fit a multi-response adaptivenonparametric regression of Y on X, giving fitted values Y : Let Sλ bethe linear operator that fits the the final chosen model and let η∗(x) bethe vector of fitted regression functions.

3. Optimal scores: Compute the eigen-decomposition of Θ ofY T Y = Y TSλY , where the eigenvectors Θ are normalized:ΘTDpΘ = IK .

4. Update: Update the final model from step 2 using the optimal scores:η(x)← ΘT η∗(x)


Penalized Discriminant Analysis(Hastie et al., 1995)


PDA: Overview

• Although FDA is motivated by generalizing optimal scoring, it canalso be viewed directly as a form of regularized discriminant analysis.

• Suppose the regression procedure used in FDA amounts to a linearregression onto a basis expansion h(X), with a quadratic penalty onthe coefficients:

ASR(θl, ηlLl=1) =1

N

L∑l=1

[N∑i=1

(θl(gi)− hT (xi)βl)2 + λβTl Ωβl

]

• Ω has a role to give penalty to “rough” ones

• The steps in FDA can be viewed as a generalized form of LDA, whichwe call PDA.


PDA: Overview

• Enlarge the set of predictors X via a basis expansion h(X).

• Use (penalized) LDA in the enlarged space, where the penalizedMahalanobis distance is given by

D(x, µ) = (h(x)− h(µ))T (ΣW + λΩ)−1(h(x)− h(µ)),

where ΣW is the within-class covariance matrix of the derivedvariables h(xi).

• Decompose the classification subspace using a penalized metric:

maxuTΣu subject to uT (Σ + λΩ)u = 1


PDA by optimal scoring

Way to the solution

1. Initialize: Form Y and H = (hij) = (hj(xi)).

2. Multivariate nonparametric regression: Fit a penalized multi-responseregression of Y on H, giving fitted values Y = S(Ω)Y : LetS(Ω) = H(HTH + Ω)−1HT be the smoother matrix of H regularizedby Ω and let β = (HTH + Ω)−1HTY θ be the penalized least squaresestimate,

3. Optimal scores: Compute the eigen-decomposition of Θ ofY T Y = Y TS(Ω)Y , where the eigenvectors Θ are normalized:ΘTDpΘ = IK .

4. Update: Update the β.


Mixture Discriminant Analysis(Hastie and Tibshirani, 1996)


MDA: Overview

• Linear discriminant analysis can be viewed as a prototype classifier.Each class is represented by its centroid, and we classify to the closestusing an appropriate metric.

• In many situations a single prototype is not sufficient to representinhomogeneous classes, and mixture models are more appropriate.


MDA: Overview


MDA: Overview

• A Gaussian mixture model for the kth class has density

P (X|G = k) =

Rk∑r=1

πkrφ(X;µkr,Σ)

where the mixing proportions πkr sum to one and Rk is a number ofprototypes for the kth class.

• The class posterior probabilities are given by

P (G = k|X = x) =

∑Rkr=1 πkrφ(X;µkr,Σ)Πk∑K

l=1

∑Rlr=1 πlrφ(X;µlr,Σ)Πl

where Πk represent the class prior probabilities


MDA: Estimation

• We estimate the parameters by maximum likelihood, using the jointlog-likelihood based on P (G,X):

K∑k=1

∑gi=k

log

[Rk∑r=1

πkrφ(X;µkr,Σ)Πk

]

• We solve above MLEs by EM algorithm


MDA: Estimation

• E-step: Given the current parameters, compute the responsibility ofsubclass ckr within class k for each of the class-k observations(gi = k):

p(ckr|xi, gi) =πkrφ(xi;µkr,Σ)∑Rkl=1 πkrφ(xi;µkr,Σ)

.

• M-step: Compute the weighted MLEs for the parameters of each ofthe component Gaussians within each of the classes, using theweights from the E-step.


MDA: Estimation

• The M-step is a weighted version of LDA, with R =∑K

k=1RK classes

and∑K

k=1NkRK observations.

• We can use optimal scoring as before to solve the weighted LDAproblem, which allows us to use a weighted version of FDA or PDA atthis stage.


MDA: Estimation

• The indicator matrix YN×K collapses in this case to a blurredresponse matrix ZN×R.

• For example,

c11 c12 c13 c21 c22 c23 c31 c32 c33

g1 = 2 0 0 0 0.3 0.5 0.2 0 0 0g2 = 1 0.9 0.1 0.0 0 0 0 0 0 0g3 = 1 0.1 0.8 0.1 0 0 0 0 0 0g4 = 3 0 0 0 0 0 0 0.5 0.4 0.1

......

gN = 3 0 0 0 0 0 0 0.5 0.4 0.1

where the entries in a class-k row correspond to p(ckr|x, gi).


MDA: Estimation by optimal scoring

Optimal scoring over EM-step of MDA:

1. Initialize: Start with set of Rk subclasses ckr, and associated subclassprobabilities p(ckr|x, gi)

2. The blurred matrix: If gi = k, then fill the kth block of Rk entries inthe ith row with the values p(ckr|x, gi), and the rest with 0s

3. Multivariate nonparametric regression: Fit a multi-response adaptivenonparametric regression of Z on X, giving fitted values Z. Let η∗(x)be the vector of fitted regression functions.

4. Optimal scores: Let Θ be the largest K non-trivial eigenvectors of ZZ,with normalization ΘTDpΘ = IK .

5. Update: Update the final model from step 2 using the optimal scores:η(x)← ΘT η∗(x), and update p(ckr|x, gi) and πkr.


Performance


Science

Linear Discriminant Analysis and Its Generalization