33
Linear Discriminant Analysis and Its Generalization Chapter 4 and 12 of The Elements of Statistical Learning Presented by Ilsang Ohn Department of Statistics, Seoul National University September 3, 2014 Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 1 / 33

Linear Discriminant Analysis and Its Generalization

  • Upload
    -

  • View
    680

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Linear Discriminant Analysis and Its Generalization

Linear Discriminant Analysis and Its GeneralizationChapter 4 and 12 of The Elements of Statistical Learning

Presented by Ilsang Ohn

Department of Statistics, Seoul National University

September 3, 2014

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 1 / 33

Page 2: Linear Discriminant Analysis and Its Generalization

Contents

1 Linear Discriminant Analysis

2 Flexible Discriminant Analysis

3 Penalized Discriminant Analysis

4 Mixture Discriminant Analysis

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 2 / 33

Page 3: Linear Discriminant Analysis and Its Generalization

Review ofLinear Discriminant Analysis

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 3 / 33

Page 4: Linear Discriminant Analysis and Its Generalization

LDA: Overview

• Linear discriminant analysis (LDA) does classification by assumingthat the data within each class are normally distributed:

fk(x) = P (X = x|G = k) = N(µk,Σ).

• We allow each class to have its own mean µk ∈ Rp, but we assume acommon variance matrix Σ ∈ Rp×p. Thus

fk(x) =1

(2π)p/2|Σ|1/2exp

−1

2(x− µk)TΣ−1(x− µk)

.

• We want to find k so that P (G = k|X = x) ∝ fk(x)πk is the largest.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 4 / 33

Page 5: Linear Discriminant Analysis and Its Generalization

LDA: Overview

• The linear discriminant functions are derived from the relation

log(fk(x)πk) = −1

2(x− µk)TΣ−1(x− µk) + log(πk) + C

= xTΣ−1µk −1

2µTk Σ−1µk + log(πk) + C ′,

and we denote

δk(x) = xTΣ−1µk −1

2µTk Σ−1µk + log(πk).

• The decision rule is G(x) = argmaxkδk(x).

• The Bayes classifier is a linear classifier.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 5 / 33

Page 6: Linear Discriminant Analysis and Its Generalization

LDA: Overview

• We need to estimate the parameters based on the training dataxi ∈ Rp and yi ∈ 1, · · · ,K by

• πk = Nk/N• µk = N−1k

∑yi=k

xi, the centroid of class k

• Σ = 1N−K

∑Kk=1

∑yi=k

(xi − µk)(xi − µk)T , the pooled samplevariance matrix

• The decision boundary between each pair of classes k and l is given by

x : δk(x) = δl(x)

which is equivalent to

(µk − µl)T Σ−1x =1

2(µk + µl)

T Σ−1(µk − µl)− log(πk/πl).

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 6 / 33

Page 7: Linear Discriminant Analysis and Its Generalization

Fisher’s discriminant analysis

• Fisher’s idea is to find a covariate v such that

maxvvTBv/vTWv.

where

- B =∑K

k=1(xk − x)(xk − x)T : between-class covariance matrix

- W =∑K

k=1

∑yi=k(xi − xk)(xi − xk)T : within-class covariance

matrix, previously denoted by (N −K)Σ

• This ratio is maximized by v1 = e1, which is the eigenvector ofW−1B with the largest eigenvalue. The linear combination vT1 X iscalled first discriminant. Similarly one can find the next direction v2orthogonal in W to v1.

• Fisher’s canonical discriminant analysis finds L ≤ K − 1 canonicalcoordinates (or a rank-L subspace) that best separate the categories.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 7 / 33

Page 8: Linear Discriminant Analysis and Its Generalization

Fisher’s discriminant analysis

• Consequently, we have v1, . . . , vL, L ≤ K − 1, which is theeigenvectors with non-zero eigenvalues.

• Fisher’s discriminant rule assigns to the class closest in Mahalanobisdistance, so the rule is given by

G′(x) = argmink

L∑l=1

[vTl (x− xk)]2

= argmink

(x− xk)T Σ−1(x− xk)

= argmink

(−2δk(x) + xT Σ−1x+ 2 log πk)

= argmaxk

(δk(x)− log πk).

• Thus Fisher’s rule is equivalent to the Gaussian classification rule withequal prior probabilities.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 8 / 33

Page 9: Linear Discriminant Analysis and Its Generalization

LDA by optimal scoring

• The standard way of carrying out a (Fisher’s) canonical discriminantanalysis is by way of a suitable SVD.

• There is a somewhat different approach: optimal scoring.

• This method is performing LDA using linear regression on derivedresponses.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 9 / 33

Page 10: Linear Discriminant Analysis and Its Generalization

LDA by optimal scoring

• Recall G = 1, · · · ,K.• θ : G → R is a function that assigns scores to the classes such that

the transformed class labels are optimally predicted by linearregression on X.

• We find L ≤ K − 1 sets of independent scorings for the class labelsθ1, · · · , θL, and L corresponding linear maps ηl(X) = XTβl chosento be optimal for multiple regression in Rl.

• θl and βl are chosen to minimize

ASR =1

N

L∑l=1

[N∑i=1

(θl(gi)− xTi βl)2].

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 10 / 33

Page 11: Linear Discriminant Analysis and Its Generalization

LDA by optimal scoring

Notation

• Y : N ×K indicator matrix

• PX = X(XTX)−1XT : projection matrix onto the column space ofthe predictors

• Θ: K × L matrix of L score vectors for the K classes.

• Θ∗ = YΘ: N ×K matrix with Θ∗ij = θj(gi).

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 11 / 33

Page 12: Linear Discriminant Analysis and Its Generalization

LDA by optimal scoring

Problem

• Minimize ASR by regressing Θ∗ on X. This says that find Θ thatminimizes

ASR(Θ) = tr(Θ∗T (I − PX)Θ∗)/N = tr(ΘTY T (I − PX)YΘ)/N

• ASR(Θ) is minimized by finding the L largest eigenvectors Θ ofY TPXY with normalization ΘTDpΘ = IL.

• Hear Dp = Y TY/N is a diagonal matrix of the sample classproportions Nj/N .

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 12 / 33

Page 13: Linear Discriminant Analysis and Its Generalization

LDA by optimal scoring

Way to the solution

1. Initialize: Form Y : N ×K.

2. Multivariate regression: Set Y = PXY and denote the p×Kcoefficient matrix by B: Y = XB.

3. Optimal scores: Obtain the eigenvector matrix Θ of Y T Y = Y TPXYwith normalization ΘTDPΘ = I.

4. Update: Update the coefficient matrix in step 2 to reflect the optimalscores: B ← BΘ. The final optimally scaled regression fit is the K − 1vector function η(x) = BTx.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 13 / 33

Page 14: Linear Discriminant Analysis and Its Generalization

LDA by optimal scoring

• The sequence of discriminant vectors νl in LDA are identical to thesequence βl up to a constant.

• That is, the coefficient matrix B is, up to a diagonal scale matrix, thesame as the discriminant analysis coefficient matrix,

V Tx = DBTx = Dη(x)

where Dll = 1/[α2l (1− α2

l )] and x is a test point. Here αl is lthlargest eigenvalue of Θ.

• Then the Mahalanobis distance is given by

δJ(x, µk) =

K−1∑l=1

wl(ηl(x)− ηkl )2 +D(x)

where ηkl = N−1k∑nk

i=1 ηl(xi) and wl = 1/[α2l (1− α2

l )].

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 14 / 33

Page 15: Linear Discriminant Analysis and Its Generalization

Generalization of LDA

• FDA: Allow non-linear decision boundary

• PDA: Expand the predictors into a large basis set, and then penalizeits coefficients to be smooth

• MDA: Model each class by a mixture of two or more Gaussians withdifferent centroids but same covariance, rather than a single Gaussiandistribution as in LDA

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 15 / 33

Page 16: Linear Discriminant Analysis and Its Generalization

Flexible Discriminant Analysis(Hastie et al., 1994)

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 16 / 33

Page 17: Linear Discriminant Analysis and Its Generalization

FDA: Overview

• Optimal scoring method provides a starting point for generalizingLDA to a nonparametric version.

• We replace the linear projection operator PX by a nonparametricregression procedure, which we denote by the linear operator S.

• One simple and effective approach toward this end is to expand Xinto a larger set of basis variables h(X) and then simply useS = Ph(X) in place of PX .

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 17 / 33

Page 18: Linear Discriminant Analysis and Its Generalization

FDA: Overview

• This regression problems are defined via the criterion

ASR(θl, ηlLl=1) =1

N

L∑l=1

[N∑i=1

(θl(gi)− ηl(xi))2 + λJ(ηl)

],

where J is a regularizer appropriate for some forms of nonparametricregression (e.g., smoothing splines, additive splines and lower-orderANOVA models).

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 18 / 33

Page 19: Linear Discriminant Analysis and Its Generalization

FDA by optimal scoring

Way to the solution

1. Initialize: Form Y : N ×K.

2. Multivariate nonparametric regression: Fit a multi-response adaptivenonparametric regression of Y on X, giving fitted values Y : Let Sλ bethe linear operator that fits the the final chosen model and let η∗(x) bethe vector of fitted regression functions.

3. Optimal scores: Compute the eigen-decomposition of Θ ofY T Y = Y TSλY , where the eigenvectors Θ are normalized:ΘTDpΘ = IK .

4. Update: Update the final model from step 2 using the optimal scores:η(x)← ΘT η∗(x)

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 19 / 33

Page 20: Linear Discriminant Analysis and Its Generalization

Penalized Discriminant Analysis(Hastie et al., 1995)

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 20 / 33

Page 21: Linear Discriminant Analysis and Its Generalization

PDA: Overview

• Although FDA is motivated by generalizing optimal scoring, it canalso be viewed directly as a form of regularized discriminant analysis.

• Suppose the regression procedure used in FDA amounts to a linearregression onto a basis expansion h(X), with a quadratic penalty onthe coefficients:

ASR(θl, ηlLl=1) =1

N

L∑l=1

[N∑i=1

(θl(gi)− hT (xi)βl)2 + λβTl Ωβl

]

• Ω has a role to give penalty to “rough” ones

• The steps in FDA can be viewed as a generalized form of LDA, whichwe call PDA.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 21 / 33

Page 22: Linear Discriminant Analysis and Its Generalization

PDA: Overview

• Enlarge the set of predictors X via a basis expansion h(X).

• Use (penalized) LDA in the enlarged space, where the penalizedMahalanobis distance is given by

D(x, µ) = (h(x)− h(µ))T (ΣW + λΩ)−1(h(x)− h(µ)),

where ΣW is the within-class covariance matrix of the derivedvariables h(xi).

• Decompose the classification subspace using a penalized metric:

maxuTΣu subject to uT (Σ + λΩ)u = 1

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 22 / 33

Page 23: Linear Discriminant Analysis and Its Generalization

PDA by optimal scoring

Way to the solution

1. Initialize: Form Y and H = (hij) = (hj(xi)).

2. Multivariate nonparametric regression: Fit a penalized multi-responseregression of Y on H, giving fitted values Y = S(Ω)Y : LetS(Ω) = H(HTH + Ω)−1HT be the smoother matrix of H regularizedby Ω and let β = (HTH + Ω)−1HTY θ be the penalized least squaresestimate,

3. Optimal scores: Compute the eigen-decomposition of Θ ofY T Y = Y TS(Ω)Y , where the eigenvectors Θ are normalized:ΘTDpΘ = IK .

4. Update: Update the β.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 23 / 33

Page 24: Linear Discriminant Analysis and Its Generalization

Mixture Discriminant Analysis(Hastie and Tibshirani, 1996)

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 24 / 33

Page 25: Linear Discriminant Analysis and Its Generalization

MDA: Overview

• Linear discriminant analysis can be viewed as a prototype classifier.Each class is represented by its centroid, and we classify to the closestusing an appropriate metric.

• In many situations a single prototype is not sufficient to representinhomogeneous classes, and mixture models are more appropriate.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 25 / 33

Page 26: Linear Discriminant Analysis and Its Generalization

MDA: Overview

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 26 / 33

Page 27: Linear Discriminant Analysis and Its Generalization

MDA: Overview

• A Gaussian mixture model for the kth class has density

P (X|G = k) =

Rk∑r=1

πkrφ(X;µkr,Σ)

where the mixing proportions πkr sum to one and Rk is a number ofprototypes for the kth class.

• The class posterior probabilities are given by

P (G = k|X = x) =

∑Rkr=1 πkrφ(X;µkr,Σ)Πk∑K

l=1

∑Rlr=1 πlrφ(X;µlr,Σ)Πl

where Πk represent the class prior probabilities

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 27 / 33

Page 28: Linear Discriminant Analysis and Its Generalization

MDA: Estimation

• We estimate the parameters by maximum likelihood, using the jointlog-likelihood based on P (G,X):

K∑k=1

∑gi=k

log

[Rk∑r=1

πkrφ(X;µkr,Σ)Πk

]

• We solve above MLEs by EM algorithm

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 28 / 33

Page 29: Linear Discriminant Analysis and Its Generalization

MDA: Estimation

• E-step: Given the current parameters, compute the responsibility ofsubclass ckr within class k for each of the class-k observations(gi = k):

p(ckr|xi, gi) =πkrφ(xi;µkr,Σ)∑Rkl=1 πkrφ(xi;µkr,Σ)

.

• M-step: Compute the weighted MLEs for the parameters of each ofthe component Gaussians within each of the classes, using theweights from the E-step.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 29 / 33

Page 30: Linear Discriminant Analysis and Its Generalization

MDA: Estimation

• The M-step is a weighted version of LDA, with R =∑K

k=1RK classes

and∑K

k=1NkRK observations.

• We can use optimal scoring as before to solve the weighted LDAproblem, which allows us to use a weighted version of FDA or PDA atthis stage.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 30 / 33

Page 31: Linear Discriminant Analysis and Its Generalization

MDA: Estimation

• The indicator matrix YN×K collapses in this case to a blurredresponse matrix ZN×R.

• For example,

c11 c12 c13 c21 c22 c23 c31 c32 c33

g1 = 2 0 0 0 0.3 0.5 0.2 0 0 0g2 = 1 0.9 0.1 0.0 0 0 0 0 0 0g3 = 1 0.1 0.8 0.1 0 0 0 0 0 0g4 = 3 0 0 0 0 0 0 0.5 0.4 0.1

......

gN = 3 0 0 0 0 0 0 0.5 0.4 0.1

where the entries in a class-k row correspond to p(ckr|x, gi).

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 31 / 33

Page 32: Linear Discriminant Analysis and Its Generalization

MDA: Estimation by optimal scoring

Optimal scoring over EM-step of MDA:

1. Initialize: Start with set of Rk subclasses ckr, and associated subclassprobabilities p(ckr|x, gi)

2. The blurred matrix: If gi = k, then fill the kth block of Rk entries inthe ith row with the values p(ckr|x, gi), and the rest with 0s

3. Multivariate nonparametric regression: Fit a multi-response adaptivenonparametric regression of Z on X, giving fitted values Z. Let η∗(x)be the vector of fitted regression functions.

4. Optimal scores: Let Θ be the largest K non-trivial eigenvectors of ZZ,with normalization ΘTDpΘ = IK .

5. Update: Update the final model from step 2 using the optimal scores:η(x)← ΘT η∗(x), and update p(ckr|x, gi) and πkr.

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 32 / 33

Page 33: Linear Discriminant Analysis and Its Generalization

Performance

Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 33 / 33