Multivariate Data Analysis Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets – For example, how do you

Multivariate Data Analysisa survey of data reduction and data

association techniques

For example

• Data reduction approaches– Cluster analysis

– Principal components analysis

– Principal coordinates analysis

– Multidimensional scaling

• Hypothesis testing approaches– Discriminant analysis

– MANOVA

– ANOSIM

– Canonical correlation

– PERMANOVA

Objects

• Things we wish to compare– sampling or experimental units

– e.g. quadrats, animals, plants, cages etc.

Variables

• Characteristics measured from each object– usually continuous variables

– e.g. counts of species, size of body parts etc.

Ecological data

• Objects:– sampling units (SU’s, e.g. quadrats, plots

etc.)

• Variables:– species abundances and/or environmental

data

• Common in community ecology

Wisconsin forests (Peet & Loucks 1977)

• Plots (quadrats) in Wisconsin forests

• Number of individuals of each species of tree recorded in each quadrat

• Objects:– quadrats

• Variables:– abundances of each tree species

Plot Bur oak Black oak White oak Red oak etc.

1 9 8 5 32 8 9 4 43 3 8 9 04 5 7 9 65 6 0 7 96 0 0 7 8

etc.

Data

Garroch Head dumping ground (Clarke & Ainsworth 1993)

• Sewage sludge dumping ground in bay• Transect across dumping ground• Core of mud at each of 10 stations

along transect• Objects:

– stations

• Variables:– metal concentrations in ppm

Station Cu Mn Co Ni Zn Cd etc.

1 26 2470 14 34 160 02 30 1170 15 32 156 0.23 37 394 12 38 182 0.24 74 349 12 41 227 0.55 115 317 10 37 329 2.2

etc.

Data

Morphological data

• Objects:– usually organisms or specimens

• Variables:– morphological measurements

Morphological data

• Morphological variation between dog species/types

• Objects:– dog types (7)

• Variables:– sizes of 6 different parts of mandible

– mandible breadth, mandible height, etc.

VariableDog type 1 2 3 4 5 6

Modern dog 9.7 21.0 19.4 7.7 32.0 36.5Jackal 8.1 16.7 18.3 7.0 30.3 32.9Chinese wolf 13.5 27.3 26.8 10.6 41.9 48.1Indian wolf 11.5 24.3 24.5 9.3 40.0 44.6Cuon 10.7 23.5 21.4 8.5 28.8 37.6Dingo 9.6 22.6 21.1 8.3 34.4 43.1Prehistoric dog 10.3 22.1 19.1 8.1 32.3 25.0

Data

Presentation of Multivariate Data

• Hard to visualize complex (more than 3 dimensions) multivariate datasets– For example, how do you visualize 7 attributes

of a dig skull

• Easier to visualize relationships between objects (e.g. similarity, dissimilarity, correlation, scaled distance)

Presentation of Multivariate Data

V1 V2 . . . . . . . . . . Vn

O1

O2

.

.

Op

xx

xx

xxx

x

x

Raw data matrixResemblance

matrix

Ordination

Classification

created using correlations, covariances or dissimilarity indices

O1

O2

.

.

Op

O1 O2 . . Op

Data StandardizationAdjusting of data so that means and/or variances or totals are the same for each variable.

examples:

– 1) centering + standardizing

xi' =

– 2) rescaling relative to the maximum

xi' =

xi - xs

xi

xmax

Principal Components Analysis

• Aims to reduce large number of variable to smaller number of summary variables called Principal Components (or factors), that explain most of the variation in the data.

• Is basically a rotation of axes after centering to the means of the variables, the rotated axes being the Principal Components.

• Is usually carried out using a matrix algebra technique called eigenanalysis.

RegressionLeast squares (OLS) estimation, allows best prediction of Y

given X (minimize distance in y direction to line)

Y

X

least squares regression line

y

x

}

y yi i residual

yi Predicted y

yi

xi

Observed y

y

PCA association among variables (minimize

distance to line in both x and y directions)

Y

X

y

x

yi

xi

Observed y

y

Y

X

y

x

y

Regression line (Y on X)

Component 1 (Factor 1)

Comparison

Y

X

y

x

y


Now rotate axes (rotation is centered on ) yx,


Rotation


Com

pone

nt 2

(F

acto

r 2)

Y

X

Y

X

y

x

y

x

y



Can be done in N dimensions

PC1PC2

PC3

Steps in PCA

1)From raw data matrix, calculate correlation matrix, or covariance matrix on standardized variables

NO3 Total Total N . . . .Organic N

Site 1

Site 2

Site 3

:

:

NO3 TON TN

NO3

TON

TN

1

0.37 1

0.84 0.13 1

Steps in PCA

2)Calculate eigenvectors

(weightings of each original variable on each component)

and eigenvalues (= "latent roots")

(relative measures of the variation explained by each component)

Eigenvectors

zik = c1yi1 + c2yi2 + . . cjyij + . . + cpyip

Where zik = score for component k for object iyi = value of original variable for object icj = factor score coefficient (weight) of variable for

component k

Example: soil chemistry in a forest

zik = c1(NO3) + c2(total organic N) + c3(total N) + ..

•the objects are sampling sites•the variables are chemical measurements, e.g. total N

Steps in PCA - continued

3)Decide how many components to retain

(scree plot of eigenvalues)

1 2 3 4 5 6 7 8

Factor

0

1

2

3

4

5

Eig

enva

lue

Steps in PCA

4)Using factor score coefficients, calculate

factor score =

coefficient x (standardized) variable

Steps in PCA

5)Position objects on scatterplot, using factor scores on first two (or three) Principal Components

-3 -2 -1 0 1 2 3FACTOR(1)

-2

-1

0

1

2

3

FA

CT

OR

(2)

Site 1

Site 2

Site 3

What are loadings?

• Correlations of original data and Factors (r’s)– For example the correlation between X and

Factor 1– Correlations range from +1 to –1– +1 indicates strong positive relationship with

NO scatter around line– -1 indicates strong negative relationship with no

scatter around line

r = 0, r2=0

r = 0, r2=0

r = 1, r2 =1 r = .77, r2= .59

r = -1, r2=1 r = - .77, r2=.59

Interpretation of r (correlation coefficient)

Factor 1

Ori

gina

l Var

iabl

e

Worked example

• Using ourworld

• Variables sampled are Population in 1983, 1986 and 1990, military spending, Gross National Product, birth rate in 1982, death rate in 1982 (7 total)

• Can these variables be reduced into fewer composite factors

Case 1, Factor 1= 3.4 (.516)+3.6 (.564) + 3.5 (.566) + 20 (. 114) + 9 (.086) + 5150 (-. 130) + 95.83 (-.092)

Case 1, Factor 2= 3.4 (.141)+3.6 (.123) + 3.5 (.104) + 20 (-.520) + 9 (-.326) + 5150 (. 574) + 95.83 (.495)

Case POP83 POP86 POP90 Birth82 Death82 GNP Mil1 3.4 3.6 3.500212 20 9 5150 95.833332 7.5 7.6 7.644275 12 12 9880 127.2368

Factor Coefficients

Raw Data

Multiply Raw Data by coefficients to get factor scores

Determine how many components (composite factors) to retain

~80% of variance explained by 2 (of 7)components

Using PCA• Run simple PCA, no rotation

• Examine loadings – correlations between factors and original variables

Rotation - Varimax

Rotated Factor Loading

Pop_1983Pop_1986Pop_1990Pop_2020Birth_82Death_82Gnp_82Mil

0.9945710.9976970.9985930.9627390.0488070.038237

-0.043234-0.004336

0.028407-0.001704-0.030743-0.187789-0.839114-0.5299270.9227980.792275

Factor 1 Factor 2

-0.5

0.0

0.5

Fact

or 2

(31

.3 %

)

Pop_2020

Birth_82

Death_82

MilGnp_82

-1.0 -0.5 0.0 0.5Factor 1 (48.9 %)

PCA - ourworld

• What have we found out– The seven examined variables can be reduced to 2 and

still retain ~ 80% of original information

• What we have not found out– Any relationships with predictor variables

• Remember PCA is a data reduction NOT hypothesis testing technique

• Can it be used to examine hypotheses?– Overlay predictor groups on Factor Plots– For example is there a relationship between the Factor

scores and Urban (Urban, City) or Group (Europe, Islamic or New World)

-2 -1 0 1 2 3 4FACTOR(1)

-2

-1

0

1

2

FA

CT

OR

(2)

NewWorldIslamicEurope

GROUP

Any contribution of Factor 1?

Any contribution of Factor 1?

-2 -1 0 1 2 3 4FACTOR(1)

-2

-1

0

1

2

FA

CT

OR

(2)

ruralcity

URBAN

PCA Regression

• What do you do if a multiple regression analysis indicates colinearity of predictor variables

• For example the relationship between a metric of Urbanization and Population in 1983, 1986 and 1990, military spending, Gross National Product, birth rate in 1982, death rate in 1982

• Perhaps PCA Regression – Factors are independent

Results of multiple regression

050

100

050

100

050

100

15

30

5101520

05000

10000

100

400

Pop_1983

0 50

Pop_1986

0 50

Pop_1990

0 50

Birth_82

15 30

Death_82

5 15

Gnp_82

0 10000

Mil

100 800

Rotation - Varimax



0.9945710.9976970.9985930.9627390.0488070.038237

-0.043234-0.004336

0.028407-0.001704-0.030743-0.187789-0.839114-0.5299270.9227980.792275

Factor 1 Factor 2

-0.5

0.0

0.5

Fact

or 2

(31

.3 %

)

Pop_2020

Birth_82

Death_82

MilGnp_82

-1.0 -0.5 0.0 0.5Factor 1 (48.9 %)

Save Principal Components

Results of PCA regression



0.9945710.9976970.9985930.9627390.0488070.038237

-0.043234-0.004336

0.028407-0.001704-0.030743-0.187789-0.839114-0.5299270.9227980.792275

Factor 1 Factor 2

Factor 2

Urb

an M

etri

c

Birth 82, Death 82

GNP 82, Mil

Dissimilarity Indices

• Dissimilarity indices:– measure how different objects are in terms

of their variable values

– how different sampling units are in species composition

– how different organisms are in morphological structure

Dissimilarity Indices

• Dissimilarity:– calculated for each pair of objects in data

set

– dissimilarity between 2 quadrats in terms of species composition

– dissimilarity between 2 dogs in terms of morphological structure

Dissimilarity

• Consider 2 objects j and k (eg. 2 quadrats)

• Let yij and yik be values for variable i in objects j and k:

Quadrat Sp1 Sp2 Sp3 i = 1 to 3

j 3 6 9

k 6 12 18

Quadrat Sp1 Sp2 Sp3 i = 1 to 3

j 3 6 9

k 6 12 18

• For sp1, y1j = 3 and y1k = 6

• For sp2, y2j = 6 and y2k = 12

• For sp3, y3j = 9 and y3k = 18

Euclidean Distance

(yij - yik)2

[(3-6)2+(6-12)2+(9-18)2]

= 11.2

Euclidean Distance

• Distance between objects when plotted in multidimensional (multivariable) space

100

50

00 50 100

Abundance of species 1

Abu

ndan

ce o

f spe

cies

2

Quadrat 1

Quadrat 2

Euclidean distance

- where min(yij,yik) = sum of lesser abundance of each species when it occurs in both sampling units

- note summation over species

Bray-Curtis

2min(yij,yik) |yij - yik|1 - =

(yij + yik) (yij + yik)

2min(yij,yik) |yij - yik|1 - =

(yij + yik) (yij + yik)

1 - [(2)(3+6+9)/(9+18+27)] = [(3+6+9)/(9+18+27)]= 0.33 = 0.33

• reach maximum value (eg. 1) when quadrats have no species in common

Quadrat Sp1 Sp2 Sp31 0 3 02 2 0 4

Euclidean = 5.4Bray-Curtis = 1

Dissimilarities in ecology

• equal 0 when quadrats are identical in species abundances

Quadrat Sp1 Sp2 Sp31 2 4 72 2 4 7

Euclidean = 0Bray-Curtis = 0

Preferred dissimilarity indices

• Species abundance data:– zeros common– max. value when quadrats have no species

in common– Bray-Curtis preferred

• Measurement data:– zeros uncommon– Euclidean OK

Worked example - Rockfish species at three sites

Rockfish.syd

Rockfish TerracePt Hopkins PtLobosBlue 60 80 120Black 10 30 54Kelp 24 50 80B&Y 3 8 12Gopher 3 8 12Copper 0 4 7Olive 10 20 26Tree 0 2 2

Bray-Curtis dissimilarity coefficients

TERRACEPT HOPKINS PTLOBOS

TERRACEPT 0.000

HOPKINS 0.295 0.000

PTLOBOS 0.480 0.216 0.000

TE

RR

AC

EP

TH

OP

KIN

S

TERRACEPT

PT

LOB

OS

HOPKINS PTLOBOS

Dissimilarities generated and compared

Distance Matrix

A B C D E

A -B 2 -C 6 5 -D 10 9 4 -E 9 8 5 3 -

Cluster Analysis

Average Linkage (UPGMA)

• Unweighted Pair-Group Method of Arithmetic Averaging

• Distance measured using the average distance of a point to a cluster

From above,dist(AC) = 6 dist(BC) = 5

In new matrix,group AB is (6 + 5)/2 from C

Shortest distance is now 3, between D and E

2)

Shortest distance is 2, between A and B

1) A B C D E

A -B 2 -C 6 5 -D 10 9 4 -E 9 8 5 3 -

A/B C D E

A/B -C 5.5 -D 9.5 4 -E 8.5 5 3 -

3) A/B C D/E

A/B -C 5.5 -D/E 9 4.5 -

4) A/B C/D/E

A/B -C/D/E 7.83 -

From Step 2,dist(CD) = 4 dist(CE) = 5

In new matrix,group DE is (4 + 5)/2 from C

Dendrograms

Linkage values can be used to construct a dendrogram

2

4

6

8

Dis

tanc

e

A B C D E

Distance Groups

0 A, B, C, D, E2 (A, B), C, D, E3 (A, B), C, (D, E)4.5 (A, B), (C, D, E)7.8 (A, B, C, D, E)

Other Linkage Methods

Single Linkage (Nearest Neighbor)

• distance measured to closest point in cluster

Complete Linkage (Furthest Neighbor)

• distance between two clusters defined as the furthest distance between any two points in them

Worked Example – compare to PCA

• Use cluster analysis to examine relationships among countries using– Population in 1983, 1986 and 1990, military

spending, Gross National Product, birth rate in 1982, death rate in 1982 (7 total) as variables

• Use average linkage, Euclidean distance

Cases are clustered – usually not informative

0 1000 2000 3000Distances

Case 1

Case 2

Case 3

Case 4

Case 5

Case 6

Case 7

Case 8

Case 9

Case 10

Case 11

Case 12

Case 13

Case 14

Case 15

Case 16

Case 17

Case 18

Case 19

Case 20

Case 21

Case 22

Case 23

Case 24Case 25

Case 26

Case 27

Case 28

Case 29

Case 30

Case 31

Case 32

Case 33

Case 34

Case 35

Case 36

Case 37

Case 38

Case 39

Case 40

Case 41

Case 42

Case 43

Case 44

Case 45

Case 46

Case 47

Case 48

Case 49

Case 50

Case 51

Case 52

Case 53

Case 54

Case 55

Case 56

Case 57

Use ID Variable

0 1000 2000 3000Distances

city

citycity

city

city

city

city

city

city

city

city

city

rural

city

city

city

city

city

city

city

rural

city

rural

ruralrural

rural

rural

rural

rural

city

rural

rural

rural

city

city

rural

citycity

city

city

city

city

city

city

city

city

city

city

rural

cityrural

rural

rural

city

city

city

city

0 1000 2000 3000Distances

city

citycity

city

city

city

city

city

city

city

city

city

rural

city

city

city

city

city

city

city

rural

city

rural

ruralrural

rural

rural

rural

rural

city

rural

rural

rural

city

city

rural

citycity

city

city

city

city

city

city

city

city

city

city

rural

cityrural

rural

rural

city

city

city

city

0 1000 2000 3000Distances

0 1000 2000 3000Distances

city

citycity

city

city

city

city

city

city

city

city

city

rural

city

city

city

city

city

city

city

rural

city

rural

ruralrural

rural

rural

rural

rural

city

rural

rural

rural

city

city

rural

citycity

city

city

city

city

city

city

city

city

city

city

rural

cityrural

rural

rural

city

city

city

city

-2 -1 0 1 2 3 4FACTOR(1)

-2

-1

0

1

2

FA

CT

OR

(2)

ruralcity

URBAN

Cluster analysis compared to PCA

Documents

Multivariate Data Analysis Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets – For example, how do you