31
Data mining and statistic al learning, lecture 4 Outline Regression on a large number of correlated inputs A few comments about shrinkage methods, such as ridge regression Methods using derived input directions Principal components regression Partial least squares regression (PLS)

Data mining and statistical learning, lecture 4 Outline Regression on a large number of correlated inputs A few comments about shrinkage methods, such

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Data mining and statistical learning, lecture 4

Outline

Regression on a large number of correlated inputs

A few comments about shrinkage methods, such as ridge regression

Methods using derived input directions Principal components regression Partial least squares regression (PLS)

Data mining and statistical learning, lecture 4

Partitioning of the expected squared prediction error

bias

Shrinkage decreases the variance but increases the bias

Shrinkage methods are more robust to structural changes in the analysed data

jjjjjj yyVaryEyEEyyE ˆ)ˆ()()ˆ( 22

Data mining and statistical learning, lecture 4

Advantages of ridge regression over OLS

The models are easier to comprehend because strongly correlated inputs tend to get similar regression coefficients

Generalizations to new data sets are facilitated by a larger robustness to structural changes in the analysed data set

Data mining and statistical learning, lecture 4

Ridge regression

- a note on standardization

The principal components and the shrinkage in ridge regression are scale-dependent.

Inputs are normally standardized to mean zero and variance one prior to the regression

Data mining and statistical learning, lecture 4

Regression methods using derived input directions

Extract linear combinations of the inputs as derived features, and then model the target (response) as a linear function of these features

MmZ Tmmm ,...,1,0 Xα

x1 x2 xp

z1 z2 zM…

y

ZβTy 0

Data mining and statistical learning, lecture 4

Absorbance records for ten samples of chopped meat

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1 12 23 34 45 56 67 78 89 100

Channel

Ab

sorb

ance

Sample_1

Sample_2

Sample_3

Sample_4

Sample_5

Sample_6

Sample_7

Sample_8

Sample_9

Sample_10

1 response variable (fat)

100 predictors (absorbance at 100 wavelengths or channels)

The predictors are strongly correlated to each other

Data mining and statistical learning, lecture 4

Absorbance records for ten samples of chopped meat

0.0

1.0

2.0

3.0

4.0

5.0

6.0

1 12 23 34 45 56 67 78 89 100

Channel

Ab

sorb

ance

Sample_12

Sample_48

Sample_133

Sample_145

Sample_176

Sample_186

Sample_215

Sample_43

Sample_44

Sample_45

High fat

samples

Low fat

samples

Data mining and statistical learning, lecture 4

3-D plots of absorbance records for samples of meat

- channels 1, 50 and 100

2

3

223

4

4

5

4

3

25

Channel1

Channel50

Channel100

3D Scatterplot of Channel1 vs Channel50 vs Channel100

Data mining and statistical learning, lecture 4

3-D plots of absorbance records for samples of meat

- channels 40, 50 and 60

3

4

223

4

5

5

4

3

25

Channel60

Channel50

Channel40

3D Scatterplot of Channel60 vs Channel50 vs Channel40

Data mining and statistical learning, lecture 4

3-D plot of absorbance records for samples of meat

- channels 49, 50 and 51

2

3

4

34

4

5

5

4

3

25

Channel49

Channel50

Channel51

3D Scatterplot of Channel49 vs Channel50 vs Channel51

Data mining and statistical learning, lecture 4

Matrix plot of absorbance records for samples of meat

- channels 1, 50 and 100

4

3

2

543

543

5

4

3

432

5

4

3

Channel1

Channel50

Channel100

Matrix Plot of Channel1, Channel50, Channel100

Data mining and statistical learning, lecture 4

Principal Component Analysis (PCA)

• PCA is a technique for reducing the complexity of high dimensional data

• It can be used to approximate high dimensional data with a few dimensions so that important features can be visually examined

Data mining and statistical learning, lecture 4

Principal Component Analysis- two inputs

5

10

15

0 5 10

X1

X2

PC1

PC2

Data mining and statistical learning, lecture 4

3-D plot of artificially generated data

- three inputs

-2

0

2

-20 2

2

4

-2

-44

0

-2

2

z

y

x

Surface Plot of z vs y, xPC1

PC2

Data mining and statistical learning, lecture 4

Principal Component Analysis

The first principal component (PC1) is the direction that maximizes the variance of the projected data

The second principal component (PC2) is the direction that maximizes the variance of the projected data after the variation along PC1 has been removed

The third principal component (PC3) is the direction that maximizes the variance of the projected data after the variation along PC1 and PC2 has been removed

Data mining and statistical learning, lecture 4

Eigenvector and eigenvalue

In this shear transformation of the Mona Lisa, the picture was deformed in such a way that its central vertical axis (red vector) was not modified, but the diagonal vector (blue) has changed direction. Hence the red vector is an eigenvector of the transformation and the blue vector is not. Since the red vector was neither stretched nor compressed, its eigenvalue is 1.

Data mining and statistical learning, lecture 4

Sample covariance matrix

where

mmm

m

ss

ss

...

..

..

..

...

1

111

mjmin

xxxxs

n

kjjkiik

íj ,...,1,,...,1,1

))(((1

2..

Data mining and statistical learning, lecture 4

Eigenvectors of covariance and correlation matrices

The eigenvectors of a covariance matrix provide information about the major orthogonal directions of the variation in the inputs

The eigenvalues provide information about the strength of the variation along the different eigenvectors

The eigenvectors and eigenvalues of the correlation matrix provide scale-independent information about the variation of the inputs

Data mining and statistical learning, lecture 4

Principal Component Analysis

Eigenanalysis of the Covariance Matrix

Eigenvalue 2.8162 0.3835

Proportion 0.880 0.120

Cumulative 0.880 1.000

Variable PC1 PC2

X1 0.523 0.852

X2 0.852 -0.523

5

10

15

0 5 10

X1

X2

Loadings

Data mining and statistical learning, lecture 4

Principal Component Analysis

161514131211109876

1

0

-1

-2

-3

First Component

Sec

ond

Com

pone

nt

Score Plot of X1-X2

Coordinates in the coordinate system determined by the principal components

Data mining and statistical learning, lecture 4

Principal Component Analysis

Eigenanalysis of the Covariance Matrix

Eigenvalue 1.6502 0.7456 0.0075Proportion 0.687 0.310 0.003Cumulative 0.687 0.997 1.000

Variable PC1 PC2 PC3x 0.887 0.218 -0.407y 0.034 -0.909 -0.414z 0.460 -0.354 0.814

-2

0

2

-20 2

2

4

-2

-44

0

-2

2

z

y

x

Surface Plot of z vs y, x

Data mining and statistical learning, lecture 4

Scree plot

321

1.8

1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0

Component Number

Eig

envalu

e

Scree Plot of x, ..., z

Data mining and statistical learning, lecture 4

Principal Component Analysis- absorbance data from samples of chopped meat

Eigenanalysis of the Covariance Matrix

Eigenvalue 26.127 0.239 0.078 0.030 0.002 0.001 0.000 0.000 0.000Proportion 0.987 0.009 0.003 0.001 0.000 0.000 0.000 0.000 0.000Cumulative 0.987 0.996 0.999 1.000 1.000 1.000 1.000 1.000 1.000

Data mining and statistical learning, lecture 4

Scree plot- absorbance data

1009080706050403020101

25

20

15

10

5

0

Component Number

Eig

envalu

e

Scree Plot of Channel1, ..., Channel100

One direction is responsible for most of the variation in the inputs

Data mining and statistical learning, lecture 4

Loadings of PC1, PC2 and PC3- absorbance data

9181716151413121111

0.2

0.1

0.0

-0.1

-0.2

Data

PC1PC2PC3

Variable

Loadings of PC1, PC2, PC3

The loadings define derived inputs (linear combinations of the inputs)

Data mining and statistical learning, lecture 4

Software recommendations

Minitab 15 Stat Multivariate Principal components

SAS Enterprise Miner Princomp/Dmneural

Data mining and statistical learning, lecture 4

Regression methods using derived input directions

- Partial Least Squares Regression

Extract linear combinations of the inputs as derived features, and then model the target (response) as a linear function of these features

x1 x1 xp

z1 z2 zM…

y

Select the intermediates so that the covariance with the response variable is maximized

Normally, the inputs are standardized to mean zero and variance one prior to the PLS analysis

Data mining and statistical learning, lecture 4

Partial least squares regression (PLS)

Step 1: Standardize inputs to mean zero and variance one

Step 2: Compute the first derived input by setting

where the 1j are standardized univariate regression coefficients of the response vs each of the inputs

Repeat:Remove the variation in the inputs along the directions

determined by existing z-vectors

Compute another derived input

p

jjjxz

111

Data mining and statistical learning, lecture 4

Methods using derived input directions

Principal components regression (PCR)The derived directions are determined by the X-matrix alone,

and are orthogonal

Partial least squares regression (PLS)The derived directions are determined by the covariance of the output and linear combinations of the inputs, and are orthogonal

Data mining and statistical learning, lecture 4

PLS in SAS

The following statements are available in PROC PLS. Items within the brackets < > are optional.

PROC PLS < options > ;

BY variables ;

CLASS variables < / option > ;

MODEL dependent-variables = effects < / options > ;

OUTPUT OUT= SAS-data-set < options > ;

To analyze a data set, you must use the PROC PLS and MODEL statements. You can use the other statements as needed.

Data mining and statistical learning, lecture 4

proc PLS in SAS

proc pls data=mining.tecatorscores method=pls nfac=10;

model fat=channel1-channel100;

output out=tecatorpls predicted=predpls;

proc pls data=mining.tecatorscores method=pcr nfac=10;

model fat=channel1-channel100;

output out=tecatorpcr predicted=predpcr;

run;