41
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton [email protected]

PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton [email protected]

Embed Size (px)

Citation preview

Page 1: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

PATTERN RECOGNITION : PRINCIPAL COMPONENTS

ANALYSIS

Richard Brereton

[email protected]

Page 2: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

NEED FOR PATTERN RECOGNITION

•Exploratory data analysis

e.g. PCA

•Unsupervised pattern recognition

e.g. Cluster analysis

•Supervised pattern recognition

e.g. Classification

Page 3: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

Case study

Coupled chromatography in HPLC : profile

0

2

4

6

8

10

12

14

1 6 11 16 21 26

Page 4: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

Tim

e : rows

Wavelength : columns

MULTIVARIATE DATA

Page 5: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

DATA MATRICES

The rows do not need to correspond to elution times in chromatography they can be any type of sample

• Blood sample

• Wood

• Chromatograms

• Samples from a reaction mixture

• Chromatographic columns

Page 6: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

The loadings do not need to correspond to spectral wavelengths they can be any type of sample

• NMR peak heights

• Atomic spectroscopy measurements of elements

• Chromatographic intensities

• Concentrations of compounds in a mixture

• Results of chromatographic tests

Page 7: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

Return to example of chromatography.

Rows : elution times

Columns : wavelengths

Page 8: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

.

X

S C

E

=

+

Chemical factors : X = C.S + E

Page 9: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

It would be nice to look at the chemical factors underlying the chromatogram. We can use mathematical methods to do this.

Page 10: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

ABSTRACT FACTORS : PRINCIPAL COMPONENTS

CHROMATOGRAM

LOADINGS

SC

OR

ES

PCA

TRANSFORMATION

SPECTRA

EL

UT

IO

N

PR

OF

IL

ES

Page 11: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

X = T . P + E = C . S + E

T are called scores: these correspond to elution profile

P are called loadings : these correspond to spectra

Ideally the “size” of T and P equals the number of compounds in the mixture.

This “size” equals the number of principal components, e.g. 1, 2, 3 etc.

Each PC has an associated scores vector (column of T), and loadings vector (column of P).

Page 12: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

Scores T

Data X

I

J

I

J

A

Loadings P

A

PCA

Page 13: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

Hence if the original data matrix is dimensions 30 28 (or I J) (= 30 elution times and 28 wavelengths - or 30 blood samples and 28 compound concentrations - or 30 chromatographic columns and 28 tests) and if the number of PCs is denoted by A, then

•the dimensions of T will be 30 A, and

•the dimensions of P will be A 28.

Page 14: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

Samples

Samples

Variables Scores

PCA

Page 15: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

A major reason for performing PCA is data simplification.

Often datasets are very complex, it is possible to make many measurements, but only a few underlying factors.

“See the wood from the trees”. Will look at this in more detail later.

Page 16: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

SCORES AND LOADINGS HAVE SPECIAL MATHEMATICAL PROPERTIES

•Scores and loadings are orthogonal.

What does this mean?

•Loadings are normalised.

What does this mean?

0.1

ib

I

iia tt 0.

1

bj

J

jaj pp

11

2

J

jajp

Page 17: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

PCA is an abstract concept.

Theory. Non-mathematical

Spectrum recorded at different concentrations and several wavelengths; wavelength 6 versus 9 : six spectra.

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Page 18: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

0

1

2

3

4

5

6

0 1 2 3 4 5 6

Page 19: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

Each spectrum becomes ONE POINT IN 2 DIMENSIONAL SPACE

(2D = 2 wavelengths)

Spectra

•Fall on a straight line which is the FIRST PRINCIPAL COMPONENT

•The line has a DIRECTION often called the LOADINGS corresponding to the SPECTRAL CHARACTERISTICS

•Each spectrum has a DISTANCE along the line often called the SCORES corresponding to CONCENTRATION

Page 20: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

EXTENSIONS TO THE IDEA

 

1. Measurement error

2. Several wavelengths

3. Several compounds

Page 21: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

0

1

2

3

4

5

6

0 1 2 3 4 5 6

•Best fit straight line - statistics

•Two PCs - the second relates to the error around the straight

Measurement error

Page 22: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

Several wavelengths

 

•Now no longer a point in 2 dimensional space.

•Typical spectrum. Several thousand wavelengths

• The number of dimensions equals the number of wavelengths.

•The spectra still fall (roughly) on a straight line.

•A point in 1000 dimensional space.

Page 23: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

Several compounds

Two compounds, two wavelengths.

A

B

Page 24: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

RANK AND EIGENVALUE

How many PCs describe a dataset?

Often unknown•How many compounds in a series of mixtures?•How many sources of pollution?•How many compounds in a reaction mixture?

•Sometimes just statistical concept.•Sometimes mixture of physical and chemical factors, e.g. a reaction mixture : compounds, temperature etc.

Page 25: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

EVERY PRINCIPAL COMPONENT HAS A CORRESPONDING EIGENVALUE

•The eigenvalue equals the sum of squares of the scores vector for each PC.

•The more important the PC the bigger the eigenvalue.

•The sum of squares of the eigenvalues of a matrix should never exceed that of the original matrix.

•The sum of squares of all significant PCs should approximate to that of the original matrix.

Page 26: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

RESIDUAL SUM OF SQUARES : decreases as the number of eigenvalues increases.

Log eigenvalue versus component number.

Cut off?

0

1

2

3

4

5

1 2 3 4 5 6 7

Page 27: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

SEVERAL OTHER APPROACHES FOR THE DETERMINATION OF NUMBER OF EIGENVALUES.

Page 28: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

SUMMARY SO FAR

PCA

• Principal components – how many?

• Scores

• Loadings

• Eigenvalues

Page 29: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

GRAPHIC DISPLAY OF PCSSCORES PLOT

PC2 VERSUS PC1

30

20

1514

1312

11

10

9

8

7

6

5

4

3

2

1

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Page 30: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

SCORES AGAINST TIME

PC1 AND PC2 VERSUS TIME

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0 5 10 15 20 25

Page 31: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

LOADINGS PLOTPC2 VERSUS PC1

220

225

230

234

239

244

249

253

258263

268

272 277

282

287

291

296

301

306

310

315

320325

329

334

349

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.0 0.1 0.2 0.3 0.4

Page 32: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

FOR REFERENCE : pure spectra

220 240 260 280 300 320 340

225

301

Page 33: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

LOADINGS AGAINST WAVELENGTH

PC1 AND PC2 VERSUS WAVELENGTH

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

220 240 260 280 300 320 340

Page 34: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

BIPLOTS : SUPERIMPOSING SCORES AND LOADINGS PLOTS

12

3

4

5 6

7

8

9

10

111213

1415

202530

349344339

334

329

325 320

315

310

306

301 296

291

287

282

277272

268

263258

253

249244

239

234

230

225

220

Page 35: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

MANY OTHER PLOTS

•Not only PC2 versus 1, also PC3 versus 1, PC3 versus 2 etc.

•3D PC plots, 3 axes, rotation etc.

•Loadings and scores sometimes presented as bar graphs, not always a sequential meaning.

•Plots of eigenvalues against component number

Page 36: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

DATA SCALING AND PREPROCESSING

Influences appearance of plots

• Column centring – common in traditional statistics

• Standardisation of columns – subtract mean and divide by standard deviation.

If data of different types or absolute scales this is an essential technique

• Row scaling – to constant total

Page 37: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

ANOTHER EXAMPLE

Grouping of elements from fundamental properties using PCA.

Page 38: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

Step 1 : standardise the data.

Why? On different scales.

Page 39: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

PERFORM PCA : Choose the first two PCs

Scores plot

Ti

PbBi

Ni

Mn

FeCu

CoZn

XnKr

Ar

Ne

He

IBrCl

F

SrCa

MgBe

CsRbK

Na

Li

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0

Page 40: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

Loadings plot

ElectroNeg

Oxidation#

Density

Boiling P.Melting P.

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Page 41: PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton r.g.brereton@bris.ac.uk

SUMMARY

•Many types of plot from PCA.

•Interpretation of the plots.

•Preprocessing important.