Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
Multivariate Data Analysisa survey of data reduction and data
association techniques
For example
• Data reduction approaches– Cluster analysis
– Principal components analysis
– Principal coordinates analysis
– Multidimensional scaling
• Hypothesis testing approaches– Discriminant analysis
– MANOVA
– ANOSIM
– Canonical correlation
– PERMANOVA
Objects
• Things we wish to compare– sampling or experimental units
– e.g. quadrats, animals, plants, cages etc.
Variables
• Characteristics measured from each object– usually continuous variables
– e.g. counts of species, size of body parts etc.
Ecological data
• Objects:– sampling units (SU’s, e.g. quadrats, plots
etc.)
• Variables:– species abundances and/or environmental
data
• Common in community ecology
Wisconsin forests (Peet & Loucks 1977)
• Plots (quadrats) in Wisconsin forests
• Number of individuals of each species of tree recorded in each quadrat
• Objects:– quadrats
• Variables:– abundances of each tree species
Plot Bur oak Black oak White oak Red oak etc.
1 9 8 5 32 8 9 4 43 3 8 9 04 5 7 9 65 6 0 7 96 0 0 7 8
etc.
Data
Garroch Head dumping ground (Clarke & Ainsworth 1993)
• Sewage sludge dumping ground in bay• Transect across dumping ground• Core of mud at each of 10 stations
along transect• Objects:
– stations
• Variables:– metal concentrations in ppm
Station Cu Mn Co Ni Zn Cd etc.
1 26 2470 14 34 160 02 30 1170 15 32 156 0.23 37 394 12 38 182 0.24 74 349 12 41 227 0.55 115 317 10 37 329 2.2
etc.
Data
Morphological data
• Objects:– usually organisms or specimens
• Variables:– morphological measurements
Morphological data
• Morphological variation between dog species/types
• Objects:– dog types (7)
• Variables:– sizes of 6 different parts of mandible
– mandible breadth, mandible height, etc.
VariableDog type 1 2 3 4 5 6
Modern dog 9.7 21.0 19.4 7.7 32.0 36.5Jackal 8.1 16.7 18.3 7.0 30.3 32.9Chinese wolf 13.5 27.3 26.8 10.6 41.9 48.1Indian wolf 11.5 24.3 24.5 9.3 40.0 44.6Cuon 10.7 23.5 21.4 8.5 28.8 37.6Dingo 9.6 22.6 21.1 8.3 34.4 43.1Prehistoric dog 10.3 22.1 19.1 8.1 32.3 25.0
Data
Presentation of Multivariate Data
• Hard to visualize complex (more than 3 dimensions) multivariate datasets– For example, how do you visualize 7 attributes
of a dig skull
• Easier to visualize relationships between objects (e.g. similarity, dissimilarity, correlation, scaled distance)
Presentation of Multivariate Data
V1 V2 . . . . . . . . . . Vn
O1
O2
.
.
Op
xx
xx
xxx
x
x
Raw data matrixResemblance
matrix
Ordination
Classification
created using correlations, covariances or dissimilarity indices
O1
O2
.
.
Op
O1 O2 . . Op
Data StandardizationAdjusting of data so that means and/or variances or totals are the same for each variable.
examples:
– 1) centering + standardizing
xi' =
– 2) rescaling relative to the maximum
xi' =
xi - xs
xi
xmax
Principal Components Analysis
• Aims to reduce large number of variable to smaller number of summary variables called Principal Components (or factors), that explain most of the variation in the data.
• Is basically a rotation of axes after centering to the means of the variables, the rotated axes being the Principal Components.
• Is usually carried out using a matrix algebra technique called eigenanalysis.
RegressionLeast squares (OLS) estimation, allows best prediction of Y
given X (minimize distance in y direction to line)
Y
X
least squares regression line
y
x
}
y yi i residual
yi Predicted y
yi
xi
Observed y
y
PCA association among variables (minimize
distance to line in both x and y directions)
Y
X
y
x
yi
xi
Observed y
y
Y
X
y
x
y
Regression line (Y on X)
Component 1 (Factor 1)
Comparison
Y
X
y
x
y
Component 1 (Factor 1)
Now rotate axes (rotation is centered on ) yx,
Component 2 (Factor 2)
Rotation
Component 1 (Factor 1)
Com
pone
nt 2
(F
acto
r 2)
Y
X
Y
X
y
x
y
x
y
Component 1 (Factor 1)
Component 2 (Factor 2)
Can be done in N dimensions
PC1PC2
PC3
Steps in PCA
1)From raw data matrix, calculate correlation matrix, or covariance matrix on standardized variables
NO3 Total Total N . . . .Organic N
Site 1
Site 2
Site 3
:
:
NO3 TON TN
NO3
TON
TN
1
0.37 1
0.84 0.13 1
Steps in PCA
2)Calculate eigenvectors
(weightings of each original variable on each component)
and eigenvalues (= "latent roots")
(relative measures of the variation explained by each component)
Eigenvectors
zik = c1yi1 + c2yi2 + . . cjyij + . . + cpyip
Where zik = score for component k for object iyi = value of original variable for object icj = factor score coefficient (weight) of variable for
component k
Example: soil chemistry in a forest
zik = c1(NO3) + c2(total organic N) + c3(total N) + ..
•the objects are sampling sites•the variables are chemical measurements, e.g. total N
Steps in PCA - continued
3)Decide how many components to retain
(scree plot of eigenvalues)
1 2 3 4 5 6 7 8
Factor
0
1
2
3
4
5
Eig
enva
lue
Steps in PCA
4)Using factor score coefficients, calculate
factor score =
coefficient x (standardized) variable
Steps in PCA
5)Position objects on scatterplot, using factor scores on first two (or three) Principal Components
-3 -2 -1 0 1 2 3FACTOR(1)
-2
-1
0
1
2
3
FA
CT
OR
(2)
Site 1
Site 2
Site 3
What are loadings?
• Correlations of original data and Factors (r’s)– For example the correlation between X and
Factor 1– Correlations range from +1 to –1– +1 indicates strong positive relationship with
NO scatter around line– -1 indicates strong negative relationship with no
scatter around line
r = 0, r2=0
r = 0, r2=0
r = 1, r2 =1 r = .77, r2= .59
r = -1, r2=1 r = - .77, r2=.59
Interpretation of r (correlation coefficient)
Factor 1
Ori
gina
l Var
iabl
e
Worked example
• Using ourworld
• Variables sampled are Population in 1983, 1986 and 1990, military spending, Gross National Product, birth rate in 1982, death rate in 1982 (7 total)
• Can these variables be reduced into fewer composite factors
Case 1, Factor 1= 3.4 (.516)+3.6 (.564) + 3.5 (.566) + 20 (. 114) + 9 (.086) + 5150 (-. 130) + 95.83 (-.092)
Case 1, Factor 2= 3.4 (.141)+3.6 (.123) + 3.5 (.104) + 20 (-.520) + 9 (-.326) + 5150 (. 574) + 95.83 (.495)
Case POP83 POP86 POP90 Birth82 Death82 GNP Mil1 3.4 3.6 3.500212 20 9 5150 95.833332 7.5 7.6 7.644275 12 12 9880 127.2368
Factor Coefficients
Raw Data
Multiply Raw Data by coefficients to get factor scores
Determine how many components (composite factors) to retain
~80% of variance explained by 2 (of 7)components
Using PCA• Run simple PCA, no rotation
• Examine loadings – correlations between factors and original variables
Rotation - Varimax
Rotated Factor Loading
Pop_1983Pop_1986Pop_1990Pop_2020Birth_82Death_82Gnp_82Mil
0.9945710.9976970.9985930.9627390.0488070.038237
-0.043234-0.004336
0.028407-0.001704-0.030743-0.187789-0.839114-0.5299270.9227980.792275
Factor 1 Factor 2
-0.5
0.0
0.5
Fact
or 2
(31
.3 %
)
Pop_2020
Birth_82
Death_82
MilGnp_82
-1.0 -0.5 0.0 0.5Factor 1 (48.9 %)
PCA - ourworld
• What have we found out– The seven examined variables can be reduced to 2 and
still retain ~ 80% of original information
• What we have not found out– Any relationships with predictor variables
• Remember PCA is a data reduction NOT hypothesis testing technique
• Can it be used to examine hypotheses?– Overlay predictor groups on Factor Plots– For example is there a relationship between the Factor
scores and Urban (Urban, City) or Group (Europe, Islamic or New World)
-2 -1 0 1 2 3 4FACTOR(1)
-2
-1
0
1
2
FA
CT
OR
(2)
NewWorldIslamicEurope
GROUP
Any contribution of Factor 1?
Any contribution of Factor 1?
-2 -1 0 1 2 3 4FACTOR(1)
-2
-1
0
1
2
FA
CT
OR
(2)
ruralcity
URBAN
PCA Regression
• What do you do if a multiple regression analysis indicates colinearity of predictor variables
• For example the relationship between a metric of Urbanization and Population in 1983, 1986 and 1990, military spending, Gross National Product, birth rate in 1982, death rate in 1982
• Perhaps PCA Regression – Factors are independent
Results of multiple regression
050
100
050
100
050
100
15
30
5101520
05000
10000
100
400
Pop_1983
0 50
Pop_1986
0 50
Pop_1990
0 50
Birth_82
15 30
Death_82
5 15
Gnp_82
0 10000
Mil
100 800
Rotation - Varimax
Rotated Factor Loading
Pop_1983Pop_1986Pop_1990Pop_2020Birth_82Death_82Gnp_82Mil
0.9945710.9976970.9985930.9627390.0488070.038237
-0.043234-0.004336
0.028407-0.001704-0.030743-0.187789-0.839114-0.5299270.9227980.792275
Factor 1 Factor 2
-0.5
0.0
0.5
Fact
or 2
(31
.3 %
)
Pop_2020
Birth_82
Death_82
MilGnp_82
-1.0 -0.5 0.0 0.5Factor 1 (48.9 %)
Save Principal Components
Results of PCA regression
Rotated Factor Loading
Pop_1983Pop_1986Pop_1990Pop_2020Birth_82Death_82Gnp_82Mil
0.9945710.9976970.9985930.9627390.0488070.038237
-0.043234-0.004336
0.028407-0.001704-0.030743-0.187789-0.839114-0.5299270.9227980.792275
Factor 1 Factor 2
Factor 2
Urb
an M
etri
c
Birth 82, Death 82
GNP 82, Mil
Dissimilarity Indices
• Dissimilarity indices:– measure how different objects are in terms
of their variable values
– how different sampling units are in species composition
– how different organisms are in morphological structure
Dissimilarity Indices
• Dissimilarity:– calculated for each pair of objects in data
set
– dissimilarity between 2 quadrats in terms of species composition
– dissimilarity between 2 dogs in terms of morphological structure
Dissimilarity
• Consider 2 objects j and k (eg. 2 quadrats)
• Let yij and yik be values for variable i in objects j and k:
Quadrat Sp1 Sp2 Sp3 i = 1 to 3
j 3 6 9
k 6 12 18
Quadrat Sp1 Sp2 Sp3 i = 1 to 3
j 3 6 9
k 6 12 18
• For sp1, y1j = 3 and y1k = 6
• For sp2, y2j = 6 and y2k = 12
• For sp3, y3j = 9 and y3k = 18
Euclidean Distance
(yij - yik)2
[(3-6)2+(6-12)2+(9-18)2]
= 11.2
Euclidean Distance
• Distance between objects when plotted in multidimensional (multivariable) space
100
50
00 50 100
Abundance of species 1
Abu
ndan
ce o
f spe
cies
2
Quadrat 1
Quadrat 2
Euclidean distance
- where min(yij,yik) = sum of lesser abundance of each species when it occurs in both sampling units
- note summation over species
Bray-Curtis
2min(yij,yik) |yij - yik|1 - =
(yij + yik) (yij + yik)
2min(yij,yik) |yij - yik|1 - =
(yij + yik) (yij + yik)
1 - [(2)(3+6+9)/(9+18+27)] = [(3+6+9)/(9+18+27)]= 0.33 = 0.33
• reach maximum value (eg. 1) when quadrats have no species in common
Quadrat Sp1 Sp2 Sp31 0 3 02 2 0 4
Euclidean = 5.4Bray-Curtis = 1
Dissimilarities in ecology
• equal 0 when quadrats are identical in species abundances
Quadrat Sp1 Sp2 Sp31 2 4 72 2 4 7
Euclidean = 0Bray-Curtis = 0
Preferred dissimilarity indices
• Species abundance data:– zeros common– max. value when quadrats have no species
in common– Bray-Curtis preferred
• Measurement data:– zeros uncommon– Euclidean OK
Worked example - Rockfish species at three sites
Rockfish.syd
Rockfish TerracePt Hopkins PtLobosBlue 60 80 120Black 10 30 54Kelp 24 50 80B&Y 3 8 12Gopher 3 8 12Copper 0 4 7Olive 10 20 26Tree 0 2 2
Bray-Curtis dissimilarity coefficients
TERRACEPT HOPKINS PTLOBOS
TERRACEPT 0.000
HOPKINS 0.295 0.000
PTLOBOS 0.480 0.216 0.000
TE
RR
AC
EP
TH
OP
KIN
S
TERRACEPT
PT
LOB
OS
HOPKINS PTLOBOS
Dissimilarities generated and compared
Distance Matrix
A B C D E
A -B 2 -C 6 5 -D 10 9 4 -E 9 8 5 3 -
Cluster Analysis
Average Linkage (UPGMA)
• Unweighted Pair-Group Method of Arithmetic Averaging
• Distance measured using the average distance of a point to a cluster
From above,dist(AC) = 6 dist(BC) = 5
In new matrix,group AB is (6 + 5)/2 from C
Shortest distance is now 3, between D and E
2)
Shortest distance is 2, between A and B
1) A B C D E
A -B 2 -C 6 5 -D 10 9 4 -E 9 8 5 3 -
A/B C D E
A/B -C 5.5 -D 9.5 4 -E 8.5 5 3 -
3) A/B C D/E
A/B -C 5.5 -D/E 9 4.5 -
4) A/B C/D/E
A/B -C/D/E 7.83 -
From Step 2,dist(CD) = 4 dist(CE) = 5
In new matrix,group DE is (4 + 5)/2 from C
Dendrograms
Linkage values can be used to construct a dendrogram
2
4
6
8
Dis
tanc
e
A B C D E
Distance Groups
0 A, B, C, D, E2 (A, B), C, D, E3 (A, B), C, (D, E)4.5 (A, B), (C, D, E)7.8 (A, B, C, D, E)
Other Linkage Methods
Single Linkage (Nearest Neighbor)
• distance measured to closest point in cluster
Complete Linkage (Furthest Neighbor)
• distance between two clusters defined as the furthest distance between any two points in them
Worked Example – compare to PCA
• Use cluster analysis to examine relationships among countries using– Population in 1983, 1986 and 1990, military
spending, Gross National Product, birth rate in 1982, death rate in 1982 (7 total) as variables
• Use average linkage, Euclidean distance
Cases are clustered – usually not informative
0 1000 2000 3000Distances
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6
Case 7
Case 8
Case 9
Case 10
Case 11
Case 12
Case 13
Case 14
Case 15
Case 16
Case 17
Case 18
Case 19
Case 20
Case 21
Case 22
Case 23
Case 24Case 25
Case 26
Case 27
Case 28
Case 29
Case 30
Case 31
Case 32
Case 33
Case 34
Case 35
Case 36
Case 37
Case 38
Case 39
Case 40
Case 41
Case 42
Case 43
Case 44
Case 45
Case 46
Case 47
Case 48
Case 49
Case 50
Case 51
Case 52
Case 53
Case 54
Case 55
Case 56
Case 57
Use ID Variable
0 1000 2000 3000Distances
city
citycity
city
city
city
city
city
city
city
city
city
rural
city
city
city
city
city
city
city
rural
city
rural
ruralrural
rural
rural
rural
rural
city
rural
rural
rural
city
city
rural
citycity
city
city
city
city
city
city
city
city
city
city
rural
cityrural
rural
rural
city
city
city
city
0 1000 2000 3000Distances
city
citycity
city
city
city
city
city
city
city
city
city
rural
city
city
city
city
city
city
city
rural
city
rural
ruralrural
rural
rural
rural
rural
city
rural
rural
rural
city
city
rural
citycity
city
city
city
city
city
city
city
city
city
city
rural
cityrural
rural
rural
city
city
city
city
0 1000 2000 3000Distances
0 1000 2000 3000Distances
city
citycity
city
city
city
city
city
city
city
city
city
rural
city
city
city
city
city
city
city
rural
city
rural
ruralrural
rural
rural
rural
rural
city
rural
rural
rural
city
city
rural
citycity
city
city
city
city
city
city
city
city
city
city
rural
cityrural
rural
rural
city
city
city
city
-2 -1 0 1 2 3 4FACTOR(1)
-2
-1
0
1
2
FA
CT
OR
(2)
ruralcity
URBAN
Cluster analysis compared to PCA