18
Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis For example Data reduction approaches Cluster analysis Principal components analysis Principal coordinates analysis Multidimensional scaling Hypothesis testing approaches Discriminant analysis MANOVA ANOSIM Canonical correlation PERMANOVA

Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Multivariate Data Analysis a survey of data reduction and data

association techniques:

Principal Components Analysis

For example

• Data reduction approaches

– Cluster analysis

– Principal components analysis

– Principal coordinates analysis

– Multidimensional scaling

• Hypothesis testing approaches

– Discriminant analysis

– MANOVA

– ANOSIM

– Canonical correlation

– PERMANOVA

Page 2: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Objects

• Things we wish to compare

– sampling or experimental units

– e.g. quadrats, animals, plants, cages etc.

Variables

• Characteristics measured from each

object

– usually continuous variables

– e.g. counts of species, size of body parts

etc.

Page 3: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Ecological data

• Objects:

– sampling units (SU’s, e.g. quadrats, plots

etc.)

• Variables:

– species abundances and/or environmental

data

• Common in community ecology

Wisconsin forests (Peet &

Loucks 1977)

• Plots (quadrats) in Wisconsin forests

• Number of individuals of each species

of tree recorded in each quadrat

• Objects:

– quadrats

• Variables:

– abundances of each tree species

Page 4: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Plot Bur oak Black oak White oak Red oak etc.

1 9 8 5 3

2 8 9 4 4

3 3 8 9 0

4 5 7 9 6

5 6 0 7 9

6 0 0 7 8

etc.

Data

Garroch Head dumping ground

(Clarke & Ainsworth 1993)

• Sewage sludge dumping ground in bay

• Transect across dumping ground

• Core of mud at each of 10 stations along transect

• Objects:

– stations

• Variables:

– metal concentrations in ppm

Page 5: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Station Cu Mn Co Ni Zn Cd etc.

1 26 2470 14 34 160 0

2 30 1170 15 32 156 0.2

3 37 394 12 38 182 0.2

4 74 349 12 41 227 0.5

5 115 317 10 37 329 2.2

etc.

Data

Morphological data

• Objects: – usually organisms or specimens

• Variables: – morphological measurements

Page 6: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Morphological data

• Morphological variation between dog

species/types

• Objects:

– dog types (7)

• Variables:

– sizes of 6 different parts of mandible

– mandible breadth, mandible height, etc.

Variable

Dog type 1 2 3 4 5 6

Modern dog 9.7 21.0 19.4 7.7 32.0 36.5

Jackal 8.1 16.7 18.3 7.0 30.3 32.9

Chinese wolf 13.5 27.3 26.8 10.6 41.9 48.1

Indian wolf 11.5 24.3 24.5 9.3 40.0 44.6

Cuon 10.7 23.5 21.4 8.5 28.8 37.6

Dingo 9.6 22.6 21.1 8.3 34.4 43.1

Prehistoric dog 10.3 22.1 19.1 8.1 32.3 25.0

Data

Page 7: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Presentation of Multivariate Data

• Hard to visualize complex (more than 3

dimensions) multivariate datasets

– For example, how do you visualize 7 attributes

of a dog skull

• Easier to visualize relationships between

objects (e.g. similarity, dissimilarity,

correlation, scaled distance)

Presentation of Multivariate Data

V1 V2 . . . . . . . . . . Vn

O1

O2

.

.

Op

x x

x x

x x x

x

x

Raw data matrix Resemblance

matrix

Ordination

Classification

created using

correlations,

covariances or

dissimilarity indices

O1

O2

.

.

Op

O1 O2 . . Op

Page 8: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Principal Components Analysis

• Aims to reduce large number of variable to smaller

number of summary variables called Principal

Components (or factors), that explain most of the

variation in the data.

• Is basically a rotation of axes after centering to the

means of the variables, the rotated axes being the

Principal Components.

• Is usually carried out using a matrix algebra

technique called eigenanalysis.

Regression

Least squares (OLS) estimation, allows best prediction of Y

given X (minimize distance in y direction to line)

Y

X

least squares

regression line

y

x

}

y yi i residual

y i Predicted y

yi

xi

Observed y

y

Page 9: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

PCA association among variables (minimize

distance to line in both x and y directions)

Y

X

y

x

yi

xi

Observed y

y

Y

X

y

x

y

Regression

line (Y on X)

Component 1 (Factor 1)

Comparison

Page 10: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

PCA association among variables (minimize

distance to line in both x and y directions)

Y

X

y

x

yi

xi

y

Principal component 1

(Factor 1)

Can be done in N dimensions

Maximum # PC’s = Original Variables-1

PC1

PC2

Page 11: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Steps in PCA

1) From raw data matrix, calculate correlation matrix,

or covariance matrix on standardized variables

NO3 Total Total N . . . .

Organic N

Site 1

Site 2

Site 3

:

:

NO3 TON TN

NO3

TON

TN

1

0.37 1

0.84 0.13 1

Steps in PCA

2) Calculate eigenvectors

(weightings of each original variable on each component)

and eigenvalues (= "latent roots")

(relative measures of the variation explained by each

component)

Page 12: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Eigenvectors

zik = c1yi1 + c2yi2 + . . cjyij + . . + cpyip

Where zik = score for component k for object i

yi = value of original variable for object i

cj = factor score coefficient (weight) of variable for

component k

Example: soil chemistry in a forest

zik = c1(NO3) + c2(total organic N) + c3(total N) + ..

•the objects are sampling sites

•the variables are chemical measurements, e.g. total N

Steps in PCA - continued

3) Decide how many components to retain

(scree plot of eigenvalues)

1 2 3 4 5 6 7 8

Factor

0

1

2

3

4

5

Eig

en

va

lue

Eigenvalue of 1

means the Factor

explains as much

variation in the

dataset as an original

variable. Values

greater than 1 indicate

useful Factors

Page 13: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Steps in PCA

4) Using factor score coefficients, calculate

factor score =

coefficient x (standardized) variable

Steps in PCA

5) Position objects on scatterplot, using factor

scores on first two (or three) Principal

Components

-3 -2 -1 0 1 2 3

FACTOR(1)

-2

-1

0

1

2

3

FA

CT

OR

(2)

Site 1

Site 2

Site 3

Page 14: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

What are loadings?

• Correlations of original data and Factors (r’s)

– For example the correlation between variable X and Factor 1

– Correlations range from +1 to –1

– +1 indicates strong positive relationship with NO scatter around line

– -1 indicates strong negative relationship with no scatter around line

r = 0, r2=0

r = 0, r2=0

r = 1, r2 =1 r = .77, r2= .59

r = -1, r2=1 r = - .77, r2=.59

Interpretation of r (correlation coefficient)

Factor 1

Ori

gin

al V

aria

ble

Page 15: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Worked example

• Using ourworld

• Variables sampled are Population in 1983,

1986 and 1990, military spending, Gross

National Product, birth rate in 1982, death

rate in 1982 (7 total)

• Can these variables be reduced into fewer

composite factors

Case 1, Factor 1= 3.4 (.560)+3.6 (.564) + 3.5 (.566) + 20 (. 114) + 9 (.086) + 5150 (-. 130) + 95.83 (-.092)

Case 1, Factor 2= 3.4 (.141)+3.6 (.123) + 3.5 (.104) + 20 (-.520) + 9 (-.326) + 5150 (. 574) + 95.83 (.495)

Case POP83 POP86 POP90 Birth82 Death82 GNP Mil

1 3.4 3.6 3.500212 20 9 5150 95.83333

2 7.5 7.6 7.644275 12 12 9880 127.2368

Factor Coefficients

Raw Data

Multiply Raw Data by coefficients

to get factor scores

Page 16: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Determine how many components

(composite factors) to retain

~80% of variance explained by 2 (of 7)

components

Using PCA

• Run simple PCA, no rotation

• Examine loadings – correlations between

factors and original variables

Page 17: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Rotation - Varimax

PCA - ourworld

• What have we found out

– The seven examined variables can be reduced to 2 and still retain ~ 80% of original information

• What we have not found out

– Any relationships with predictor variables

• Remember PCA is a data reduction NOT hypothesis testing technique

• Can it be used to examine hypotheses?

– Overlay predictor groups on Factor Plots

– For example is there a relationship between the Factor scores and Urban (Urban, City) or Group (Europe, Islamic or New World)

Page 18: Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

-2 -1 0 1 2 3 4

FACTOR(1)

-2

-1

0

1

2

FA

CT

OR

(2)

NewWorldIslamicEurope

GROUP

Any contribution of Factor 1?

Any contribution of Factor 1?

-2 -1 0 1 2 3 4

FACTOR(1)

-2

-1

0

1

2

FA

CT

OR

(2)

ruralcity

URBAN