35
Extracting information from spectral data. Nicole Labbé, University of Tennessee SWST, Advanced analytical tools for the wood industry June 10, 2007

Extracting information from spectral data

Embed Size (px)

Citation preview

Page 1: Extracting information from spectral data

Extracting information from spectral data.

Nicole Labbé, University of Tennessee

SWST, Advanced analytical tools for the wood industry June 10, 2007

Page 2: Extracting information from spectral data

Data collection

Near Infrared spectra2150 data points, 350-2500

nm, 1 nm resolution, 8 scans

Mid Infrared spectra3400 data points, 4000-600 cm-¹,

1 cm-¹ resolution, 4 scans

Laser Induced Breakdown spectra30000 data points, 200-800 nm,

0.02 nm resolution, 10 scans

W avenumber (cm-1)

8001600240032004000

W a v e le n g th (n m )

4 0 0 8 0 0 1 2 0 0 1 6 0 0 2 0 0 0 2 4 0 0

Wavelength (nm)

200 300 400 500 600 700 800

Page 3: Extracting information from spectral data

Wavelength

Abs

orba

nce

Prior to the extraction of the information.Signal processing is used to transform spectral data prior to analysis

Data pretreatment

-Local filters

-Smoothing

-Derivatives

-Baseline correction

-Multiplicative Scatter Correction (MSC)

-Orthogonal Scatter Correction (OSC)

W avelength (nm)

Sig

nal

Noisy signal of an absorption bandSmoothed with a moving average filter

Wavelength

Abso

rban

ce

Raw spectral data

Corrected spectral data after MSC

Page 4: Extracting information from spectral data

1. Qualitative informationGrouping and classification of spectral objects from samples into supervised and non-supervised learning methods.

2. Quantitative informationRelationships between spectral data and parameter(s) of interest

How to extract the information?1. Multivariate analysis (MVA)

Principal Component Analysis (PCA), Projection to Latent Structures (PLS), PLS-Discriminant Analysis (PLS-DA), …

2. Two dimensional correlation spectroscopyHomo-correlation, Hetero-correlation

What type of information?

Page 5: Extracting information from spectral data

Separating the signal from the noise in data and presenting the results as easily interpretable plots. Why are multivariate methods needed?

Multivariate data analysis

-Large data sets-Problem of selectivityRelationship between two variables: very simple, but does not always workMany problems where several predictor variables used in combination give better results

Wavelength (nm)

800 1200 1600 2000 2400

Abso

rban

ce

0.0

0.5

1.0

1.5

2.0

2.5

3.0

375 x-variables

Near infrared spectra collected on 70 pine samples

Two approaches:-Univariate analysis-Multivariate analysis

Page 6: Extracting information from spectral data

Measured cellulose content (%)

25 30 35 40 45 50

Pred

icte

d C

ellu

lose

con

tent

(%)

25

30

35

40

45

50

Measured cellulose content versus predicted cellulose content using one

variable (1530 nm) as a predictor (R2 = 0.12)

Measured cellulose content versus predicted cellulose content using multivariate method based on all variables from the spectral data

(R2= 0.87)

Univariate analysis

Multivariate analysis

Measured cellulose content (%)

25 30 35 40 45 50

Pre

dict

ed c

ellu

lose

con

tent

(%)

25

30

35

40

45

50

Page 7: Extracting information from spectral data

Qualitative informationPrincipal Components Analysis (PCA)

Recognize patterns in data: outliers, trends, groups…

PC1 (89%)-6 -4 -2 0 2

PC2

(8%

)

-0.8

-0.4

0.0

0.4

0.8

Page 8: Extracting information from spectral data

Biomass species and near infrared spectra

Spectral datax variables

n Samples

Page 9: Extracting information from spectral data

Wavelength (nm)800 1200 1600 2000 2400

Inte

nsity

(A.U

.)

-0.06

0.00

0.06

0.12

0.18PC1PC2

Transformation of the numbers into pictures by using principal component analysis (scores)

PC1-30 -20 -10 0 10 20

PC2

-30

-20

-10

0

10

20

30

Red oak

Hichory

Bagasse

Corn stover

Switchgrass

Yellow poplar

Page 10: Extracting information from spectral data

Transformation of the numbers into pictures by using principal component analysis (loadings)

PC1-30 -20 -10 0 10 20

PC2

-30

-20

-10

0

10

20

30

Red oak

Hichory

Bagasse

Corn stover

Switchgrass

Yellow poplar

Wavelength (nm)800 1200 1600 2000 2400

Inte

nsity

(A.U

.)

-0.06

0.00

0.06

0.12

0.18PC1PC2

Page 11: Extracting information from spectral data

PCA is a projection method, it decomposes the spectral data into a “structure” part and a “noise” part

X is an n samples (observations) by x variables (spectral variables) matrix

Principal Components Analysis (PCA)

x10

1 variable = 1 dimension

Page 12: Extracting information from spectral data

X is an n samples (observations) by x variables (spectral variables) matrix

Principal Components Analysis (PCA)

x2

x1

2 variables = 2 dimensions

x1

x2

x3

3 variables = 3 dimensional space

Page 13: Extracting information from spectral data

Principal Components Analysis (PCA)

Wavelength (nm)

800 1200 1600 2000 2400 2800

Abso

rban

ce

0.0

0.5

1.0

1.5

2.0

2.5

3.0

375 x-variables

Beyond 3 dimensions, it is very difficult to visualize what’s going on.

Each sample is represented as a co-ordinate axis in 375-dimensional space

18 Near infrared spectra of wood samples

Page 14: Extracting information from spectral data

Principal Components Analysis (PCA)

The first principal component

New co-ordinate axis representing the direction of maximum variation through the data.

X has only 3 variables (wavelengths x1, x2 and x3)The sample (n = 18) are represented in a 3D space

Page 15: Extracting information from spectral data

Higher-order Principal Components (PC2, PC3,…)

After PC1, next best direction for approximating the original data

The second PC lies along a direction orthogonal to the first PC

PC3 will be orthogonal to both PC1 and PC2 while simultaneously lying along the direction of the third largest variation.

The new variables (PCs) are uncorrelated with each other (orthogonal)

Page 16: Extracting information from spectral data

Scores (T) = Coordinates of samples in the PC space

There is a set of scores for each PC (score vector)

Original variable space

PC space

Representation of the samples in the PC space

Loadings (P) = Relations between X and PCs

Relationship between the original variable space and the new PCs spaceThere is a set of loadings for each PC (loading vector)

Page 17: Extracting information from spectral data

Transformation of the numbers into pictures by using PCA

PC1-30 -20 -10 0 10 20

PC2

-30

-20

-10

0

10

20

30

Red oak

Hichory

Bagasse

Corn stover

Switchgrass

Yellow poplar

Wavelength (nm)800 1200 1600 2000 2400

Inte

nsity

(A.U

.)

-0.06

0.00

0.06

0.12

0.18PC1PC2

Page 18: Extracting information from spectral data

Establish relationships between input and output variables, creating predictive models.

Quantitative informationProjection to Latent Structures or Partial Least Squares Regression (PLS)

PLS can be seen as two interconnected PCA analyses, PCA(X) and PCA(Y)

PLS uses the Y-data structure (variation) to guide the decomposition of X

The X-data structure also influences the decomposition of Y

X + Y ⇒ Model

Establishing calibration model from known X and Y data

X Model ⇒

Using calibration model to predict new Y-values from new set of X-

measurement

+ Ŷ

Page 19: Extracting information from spectral data

Biomass composition and near infrared spectra

Spectral data x variables y variables

Sam

ples

If y variables are not correlated ⇒ PLS1

Page 20: Extracting information from spectral data

Biomass composition and near infrared spectra

Spectral data x variables y variables

Sam

ples

If y variables are correlated ⇒ PLS2

Page 21: Extracting information from spectral data

Biomass composition and near infrared spectraSa

mpl

es

2/3 of the samples for calibration model1/3 of the samples for validation model

Random selection

Page 22: Extracting information from spectral data

Measured cellulose content (%)

25 30 35 40 45 50

Pred

icte

d C

ellu

lose

con

tent

(%)

25

30

35

40

45

50

Wavelength (nm)

1000 1200 1400 1600 1800 2000 2200 2400

Inte

nsity

-12

-8

-4

0

4

8

12

Calibration model to predict cellulose content

in pine

r = 0.95RMSEC = 1.6

Page 23: Extracting information from spectral data

Validation of the model to predict cellulose

content in pine

Measured cellulose content (%)

25 30 35 40 45 50

Pre

dict

ed c

ellu

lose

con

tent

(%)

25

30

35

40

45

50

r = 0.95RMSEP = 1.55

Page 24: Extracting information from spectral data

A powerful method for classification. The aim is to create a predictive model which can accurately classify future unknown samples.

PLS-Discriminant Analysis (PLS-DA)

y variable Spectral datax variables

Page 25: Extracting information from spectral data

PLS-Discriminant Analysis (PLS-DA)

Development of PLS-DA calibration model

Page 26: Extracting information from spectral data

Measured Y variable

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Pre

dict

ed Y

var

iabl

e

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

PC1 (92%)

-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

PC

2 (6

%)

-2

-1

0

1

2

Yellow poplar

Hickory

PLS-DA Calibration model

r = 0.99RMSEC = 0.04

Page 27: Extracting information from spectral data

PLS-Discriminant Analysis (PLS-DA)

Validation of PLS-DA model

Page 28: Extracting information from spectral data

PLS-DA Validation model

Y reference

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Pred

icte

d Y

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2r = 0.99RMSEP = 0.04

Yellow poplar

Hickory

Y-reference Predicted YSpectrum 00008 0.0000 -0.0338Spectrum 00009 0.0000 0.0270Spectrum 00015 1.0000 0.9340Spectrum 00016 1.0000 1.0220

Page 29: Extracting information from spectral data

2D correlation tools spread spectral peaks over a

second dimension

simplify the visualization of complex spectra

Two-dimensional correlation spectroscopy

Electro-magneticProbe (eg, IR, UV, LIBS,…)

System2D

correlation maps

PerturbationMechanical, electrical,chemical, magneticoptical, thermal, …

Page 30: Extracting information from spectral data

Generation of orthogonal components: synchronous and asynchronous 2D correlation

intensities

Spectral data collected under perturbation

Wavelength (nm)1000 1200 1400 1600 1800 2000 2200 2400

Inte

nsity

(A.U

.)

0.5

1.0

1.5

2.0

2.5

Wavelength (nm)

1000 1200 1400 1600 1800 2000 2200 2400

Inte

nsity

(A.U

.)

-0.10

-0.05

0.00

0.05

0.10

0.15

Dynamic spectra [DYN]

[S] : m samples by n variables (wavelength) matrix(mxn)

Synchronous matrix

Asynchronous matrix

Reference spectrum

Noda-Hilbert matrix

[ ]( ) [ ] [ ]( )nmnnm SSDYN ×

→×× −= 1

1)1(

[ ]( ) [ ]( ) [ ]( )nmTmnnn DYNDYNSYNC ××× ×=

[ ]( ) [ ]( ) [ ] [ ]( )nmmmT

mnnn DYNNDYNASYN ×××× ××= )(

Page 31: Extracting information from spectral data

Homo-correlationNIR/NIR

Hetero-correlationNIR/MBMS

Perturbation: cellulose content

Page 32: Extracting information from spectral data

Other techniques to extract information

•Soft Independence Modeling of Class Analogy (SIMCA)

• Kernel Principal Component Analysis (k-PCA)

Classification

• Artificial Neural Networks (ANN)

non-linear data

Schölkopf B., Smola AJ., Müller K. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10: 1299-1319

Demuth H., Beale M. and Hagan M., Neural Network Toolbox 5 User’s Guide – Matlab

The Unscrambler, User manual. CAMO,1998

Page 33: Extracting information from spectral data

Other techniques to extract information

• Artificial Neural Networks (ANN)Demuth H., Beale M. and Hagan M., Neural Network Toolbox 5 User’s Guide – Matlab

•Kernel Projection to Latent Structures (k-PLS)Rosipal R., Trejo L.J. (2001) Kernel partial least squares regression in reproducing Kernel Hilbert space. J. Machine Learning Res. 2: 97–123

Regression

• Supervised Probabilistic Principal Component Analysis (SPPCA)

Yu S., Yu K., Tresp V., Kriegel H., Wu M. (2006) Supervised probabilistic principal component analysis. Proceedings of the 12th International Conference on Knowledge Discovery and Data Mining (SIGKDD):464-473

non-linear data

Trygg J., Wold S. (2002) Orthogonal projections to latent structures (O-PLS). J. Chemo. 16:119-128• Orthogonal Projections to Latent Structures (O-PLS)

Page 34: Extracting information from spectral data

Softwares

•A user-friendly guide to Multivariate Calibration and Classification; T. Næs, T. Isaksson, t. Fearn, T. Davies, NIR Publications, Chichester, UK, 2002

•Multivariate calibration, H. Martens and T. Næs, John Wiley & Sons, Chichester, UK, 1989

•Chemometric techniques for quantitative analysis, Marcel Dekker, New York, 1998

•Two-dimensional correlation spectroscopy, I. Noda and Y. Ozaki, John Wiley & Sons, Chichester, UK, 2004

•Neural Network Toolbox 5 User’s Guide – Matlab, H. Demuth, M. Beale and M. Hagan.

References

• www.camo.com• www.umetrics.com• www.infometrix.com• www.mathworks.com

Page 35: Extracting information from spectral data

Questions?