Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of...

Preview:

Citation preview

Principal Components Analysis

BMTRY 7263/27/14

UsesGoal: Explain the variability of a set of variables using a “small”

set of linear combinations of those variablesWhy: There are several reasons we may want to do this

(1) Dimension Reduction (use k of p components)-Note, total variability still requires p components

(2) Identify “hidden” underlying relationships (i.e. patterns in the data)-Use these relationships in further analyses

(3) Select subsets of variables

“Exact” Principal ComponentsWe can represent data X as linear combinations of p random

measurements on j = 1,2,…,n subjects

“Exact” Principal ComponentsPrincipal components are those combinations that are:

(1) Uncorrelated (linear combinations Y1, Y2,…, Yp)

(2) Variance as large as possible(3) Subject to:

'1

' '1 1 1

'2

' ' ' '2 2 2 1 2

'

' ' ' '

1 linear combo maximizes

subject to 1

2 linear combo maximizes

subject to 1 and , 0

linear combo maximizes

subject to 1 and , 0 f

st

nd

thp

p p p i p

PC

Var

PC

Var Cov

p PC

Var Cov

a X

a X a a

a X

a X a a a X a X

a X

a X a a a X a X

or i p

Finding PC’s Under Constraints

• So how do we find PC’s that meet the constraints we just discussed?

• We want to maximize subject to the constraint that

• This constrained maximization problem can be done using the method of Lagrange multipliers

• Thus we want to maximize the function

' ' 1i i i i i a a a a

' 'i i i iVar Y Var a X a a

' 1i i a a

Finding PC’s Under Constraints

• Differentiate w.r.t ai :

Finding PC’s Under Constraints

• But how do we choose our eigenvector (i.e. which eigenvector corresponds to which PC?)

• We can see that what we want to maximize is

• So we choose li to be as large as possible

• If l1 is our largest eigenvalue with corresponding eigenvector ei then the solution for our max is

' ' 'i i i i i i i i i a a a a a a

1 1 1 a e

Finding PC’s Under Constraints

• Recall we had a second constraint

• We could conduct a second Lagrangian maximization to find our second PC

• However we already know that eigenvectors are independent (so this constraint is met)

• We choose the order of the PCs by the magnitude of the eigenvalues

' ', , 0i k i kCov Y Y Cov a X a X

“Exact” Principal ComponentsSo we can compute the PCs from the variance matrix of X, S:

1 2

1 2

'

'

'1 1 2 2

1. eigenvalues of

2. , , , corresponding eigenvectors of such that

1

0

This yields our principalcomponent

...

p

p

i i i

i i

i j

th

i i i i pi p

Var

i j

i

Y

X Σ

e e e Σ

Σe e

e e

e e

e X e X e X e X

PropertiesWe can also find the moments of our PC’s

'1 1 11 1 12 2 1

1

1

First : ... p pPC Y

E Y

Var Y

e X e x e x e x

PropertiesWe can also find the moments of our PC’s

'1 1 2 2: ...

,

thk k k k kp p

k

k

i k

k PC Y

E Y

Var Y

Cov Y Y

e X e x e x e x

PropertiesNormality assumption not required to find PC’sIf Xj ~ Np(m,S) then:

Total Variance:

1'1 1

2' '

'

1 2

~ ,

and , ,..., are independent

j

j j

pj pp

j j pj

Y

N

Y

Y Y Y

e

X Γ X Γ μ

e

1 2

1 2

1 2

1 1

...

...

...

and proportion total variance accounted for component

p

p

p

th

k kp p

i ii i

trace Var X Var X Var X

Var Y Var Y Var Y

k

Var Y

Var Y

Σ

Principal ComponentsConsider data with p random measures on j = 1,2,…,n subjectsFor the jth subject we then have the random vector

1

2 1,2,...,

Suppose ~ , ...if we set 2 we know what looks like

j

j

j

pj

j j

X

Xj n

X

N p

X

X μ Σ X

X1

X2

m1

m2

Graphic Representation

'1 2

' 1 2

' '

1

11 '

1

1'

1 1' 1 1 ' 1 '

1 '

'11

2 ' 1 ' '1

~ ,

Densityof is constant on theellipsoid

Recall: and

Note:

Λ

Λ but and

Λ

i

i

j j j j pj

p

i i i i ii

p

i i ii

p

i ii

i i

N X X X

c

Y

c

X μ Σ X

X X Σ X

e X Σ e e

Σ e e

P P

P P P P P P

P P

e e

X Σ X X e e

22 21 2

2 2 21 2

'11 1

and ... 1

i

p

p

p p

i ii i

YY Y

c c c

Y Y

X

Graphic RepresentationNow suppose X1, X2 ~ N2(m, S)

Y1 axis selected to maximize variation in the scores

Y2 axis must be orthogonal to Y1 and maximize variation in the scores

2

1 11

n

jjY Y

2

2 21

n

jjY Y

Y2

X2

Y1

X1

Dimension ReductionProportion of total variance accounted for by the first k

components is

If the proportion of variance accounted for by the first k principal components is large, we might want to restrict our attention to only these first k components

Keep in mind, components are simply linear combinations of the original p measurements

Ideally look for meaningful interpretations of our choose k components

1

1

k

iip

ii

PC’s from Standardized VariablesWe may want to standardize our variables before finding PCs

1 1

1111

2 2

12222 2

11 11

1 12 2

1

11

2

1

1 1

1 1

p ppp

pp

pp pp

X

X

Xp

ij

ij

ii jj

Z

Z

Z

Cov

Z X μ V X μ

Z V ΣV

ρ

PC’s from Standardized Variables

So the covariance of V equals the correlation of XWe can define our PC’s for Z the same way as before….

12

'

1

PC :

but now and are the eigenvalues/vectors for

because they are standardized:

1 and

thi i

i i

p

i ii

i Y

Var Y Var Y p

Z V X μ

e Z

e ρ

Compare Standardized/Non-standardized PCs

' '1 1 1 1

' '2 2 2 2

1

1 2

Non-standardized Standardized

1 4 1 0.4

4 100 0.4 1

100.16 0.04 0.999 1.4 0.707 0.707

0.84 0.999 0.04 0.6 0.707 0.707

proportion varianceexplained by the first

100.160

101

PC

Σ ρ

e e

e e

1

1 2

1.4.992 0.70

2

EstimationIn general we do not know what S is- we must estimate if from

the sampleSo what are our estimated principal components?

1 2

'1

1 1

1 2 1 2

1 2 1 2

Assume we havea randomsample : , ,...,

We can use :

Eigenvalues for :

ˆ ˆ ˆ.... (consistent estimators , ,..., )

Eigenvectors for :

ˆ ˆ ˆ.... (consistent estimators , ,..., )

n

n

j jn j

p p

p p

X X X

S X X X X

S

S

e e e e e e

'

principal component :

ˆˆ

th

i i

i

y e x

Sample PropertiesIn general we do not know what S is- estimate it from sampleSo what are our estimated principal components?

21

1 1

1 2 1

ˆ1. Estimated Variance of y :

ˆ ˆ ˆ

ˆ ˆ2. Sample covariance and correlations for , :

ˆ ˆ, 0

ˆ3. Proportion total variance accounted for by :

ˆ ˆ

ˆ ˆ ˆ ˆ...

i

n

i ij in j

i k

i k

k

k kp

p ii

y y

y y

Cov y y i k

y

1ˆ , 2 2

1 1

ˆ4. Estimated correlation for , :

ˆˆ ˆ ˆ

ˆ ˆi k

i k

n

ij i kj kj ik iy x

n nkk

ij i kj kj j

y

y y X Xr

sy y X X

x

e

CenteringWe often center our observations before defining our PCsThe centered PCs are found according to:

'

'

'11

'11

'1

ˆˆ , 1,2,...,

ˆˆ , 1, 2,...,

ˆˆ

ˆ

ˆ

i i

ij i j

n

i i jn j

n

i jn j

in

y i p

y j n

y

e x x

e x x

e x x

e x x

e 0 0

ExampleJolicoeur and Mosimann (1960) conducted a study looking at the

relationship between size and shape of painted turtle carapaces.

We can develop PC’s for natural log of length, width, and height of female turtles’ carapaces

1

2

3

1 2 3

log carapace length .0624 .0201 .0249

log carapace width & .0162 .0194

log carapace height .0249

.627 .553 .550

ˆ ˆ ˆ.488 , .272 .830

.608 .788 .993

j

j j

j

x

x

x

x S

e e e

ˆ .06623 .00077 .00054

λ

ExampleThe first PC is:

This might be interpreted as an overall size component

'1 1ˆˆ

0.627*log length 0.488*log width 0.608*log height

y

e x

Shell dimensionssmall

Small valuesy1

Shell dimensionslarge

Large valuesy1

ExampleThe second PC is:

Emphasizes contrast between length and height of the shell

'2 2ˆˆ

0.553*log length 0.272*log width 0.788*log height

y

e x

Small valuesy2

Large valuesy2

ExampleThe third PC is:

Emphasizes contrast between width and length of the shell

'3 3ˆˆ

0.550*log length 0.830*log width 0.099*log height

y

e x

Small valuesy3

Large valuesy3

Example

Consider the proportion of variability accounted for by each PC

ˆ .06623 .00077 .00054λ

ExampleHow are the PCs correlated with each of the x’s?

Then

ˆ ,

ˆˆi j

ij iy

jj

er

s

x

Trait

x10.99 0.09 -0.08

X20.99 0.06 0.15

X30.99 -0.14 -0.01

1 1

11 1ˆ ,

11

ˆˆ 0.627 0.066230.99

0.0264y

er

s

x

1y 2y 3y

Interpretation of PCsConsider data x1, x2, …., xp:

PCs are actually projections onto the estimated eigenvectors

-1st PC is the one with the largest projection-For data reduction, only use PCA if the eigenvalues vary-If x’s are uncorrelated, we can’t really do data reduction

11

'

' 1 2

'

ˆˆLet:

Consider thecontour:

This contour mimics thedensityof ,

ˆ ˆˆ length of projection of in direction of

p

ip i

i i

i i i

Var

y

c

N

y

x S x x

e x x

x x S x x

μ Σ

e x x x x e

Choosing Number of PCsOften the goal of PCA is dimension reduction of data

Select a limited number of PCs that capture majority of the variability in the data

How do we decide how many PCs to include:1. Scree plot: plot of versus i2. Select all PCs with (for standardized observations)

3. Choose some proportion of the variance you want to account for

iˆ 1i

Scree Plots

Choosing Number of PCsShould principal components that only account for a small

proportion of variance always be ignored?

Not necessarily, they may indicate near perfect colinearities among traits

In the turtle example, this is true-very little variation of the variation in shell measurements can be attributed to the 2nd and 3rd components

Large Sample PropertiesIf n is large, there are nice properties we can use

' '11 21 1

2

1

2

22

ˆ ˆ ˆ ˆwith

ˆFor large : , 2

where:

a. estimated eigen values for are asymptotically independent

ˆb. distribution of ~ ,

c.

n

j j pn j

D

p

p

i i in

n n N

diag

N

S x x x x λ

λ λ 0 Λ

Λ λ

Σ

2 2

2 21 1

2 2

2 2

21

ˆ ˆ an approximate CI for is:

1 1

ˆd. alternative approximation: ln ~ ln ,

ˆ ˆn n

i ii i

n n

i i n

z z

i i i

z z

N

e e

Large Sample PropertiesAlso for our estimated eigenvectors

These results assume that X1, X2, …., Xn are N(m, S)

'

2

ˆ1. For large : ,

where:

ˆ ˆ2. For large , is approximately independent of the distribution for

D

i i p i

ki i k kk i

k i

i i

n n N E

E

n

e e 0

e e

e

SummaryPrincipal component analysis most useful for dimensionality

reduction

Can also be used for identifying colinear variables

Note, use of PCA in a regression setting is therefore one way to handle multi-colinearity

A caveat… principal components can be difficult to interpret and should therefore be used with caution