Upload
arielle-widger
View
220
Download
1
Embed Size (px)
Citation preview
Introduction to Multivariate Analysis
Biology 4605/7220Chih-Lin Wei
Canadian Health Oceans Network Postdoc FellowOcean Science Centre, MUN
My Background
• Benthic ecologist: Community ecologyHow environments control macroecological patterns in the deep-seaInterested in R but “NOT a statistician”.
• Education: BS in Zoology in Taiwan; MS & PhD in Biological Oceanography, Texas A&M University
• Current project: Scale-up regional benthic diversity and standing stock pattern using ecological modeling approaches
Lecture Contents
• Visualization• Resemblance index• Cluster analysis• Ordination• Correlation• Testing for difference• Other stuff
Clarke & Warwick (2001)
Front Matter
• Mostly non-parametric, permutation-based techniques
• Start with graphical concept
• Followed by examples in simple R codes
• No more than 3 lines of code for each example
• Most functions in base R or package “vegan”
• All analyses are available on commercial software (PRIMER-E) [demo version]
R packages
# Install and load R Packagesinstall.packages( c("vegan", "scatterplot3d", "reshape2", "lattice", "clustsig") )library( vegan )library( scatterplot3d )library( reshape2 )library( lattice )library( clustsig)
First thing first, plot the data
5 10 15
50
10
01
50
20
02
50
30
0
Murder
Ass
au
lt
# Violent Crime Rates by US State
USArrests
plot( USArrests[,1:2] )
3D Scatter Plot
scatterplot3d( USArrests[,1:3] )
0 5 10 15 20
30
40
50
60
70
80
90
100
0 50
100150
200250
300350
Murder
Ass
au
lt
Urb
an
Po
p
Scatterplot Matrices
pairs( USArrests )Murder
50 150 250 10 20 30 40
510
15
5015
025
0Assault
UrbanPop
3050
7090
5 10 15
1020
3040
30 50 70 90
Rape
Lattice Graphs
# Melt dataframe to flat formatm = melt( USArrests,
id.vars = "Assault" )m
# Multipanel scatter plotxyplot( value ~ Assault | variable,
data = m )
Assault
valu
e
0
20
40
60
80
50 100 150 200 250 300 350
Murder UrbanPop0
20
40
60
80
Rape
Resemblance/distance Indices
•
Clarke & Warwick (2001)
*Not good for data with lots of zero(e.g. species abundance)
Resemblance/distance Indices
• • D = 0, if species are identical in 2 samples
• D = 1, if 2 samples have no species in common
• Better for species abundance data (with lots of zero)
Resemblance/distance Indices
# Euclidean Distance:
dist( USArrests )
# Bray-Crutis Dissimilarity# Vegetation in lichen pasturesdata( varespec )varespec
vegdist( varespec )
1 3 2 4
0.3
0.4
0.5
0.6
0.7
Cluster Dendrogram
Dis
sim
ilari
ty
0.0
0.2
0.4
0.6
Hierarchical Clustering• Patterns in distance or
dissimilarity matrix is difficult to detect.
• Find natural grouping by successive fusing of samples
Hierarchical Clustering
Linkage Options:
•Single linkage (neareast neighbour clustering)
•Complete linkage (furthest neighbour clustering)
•Group-average linkage
•Ward’s minimum variance
Group 1 Group 2
Sp 1
Sp 2
Single Link
Complete Link
Ala
ska
Flo
rida
Del
awar
eH
awai
iR
hode
Isl
and
Ken
tuck
yM
isso
uri
Uta
hO
rego
nW
ashi
ngto
nM
assa
chus
etts
New
Jer
sey
Con
nect
icut
Ver
mon
tW
est
Vir
gini
aA
rkan
sas
Sou
th D
akot
aId
aho
Nor
th D
akot
aM
inne
sota
Mai
neW
isco
nsin
Iow
aN
ew H
amps
hire
Wyo
min
gV
irgi
nia
Neb
rask
aO
klah
oma
Mon
tana
Indi
ana
Ohi
oK
ansa
sP
enns
ylva
nia
Nev
ada
Nor
th C
arol
ina
Ari
zona
Mic
higa
nM
aryl
and
New
Mex
ico
Illin
ois
New
Yor
kT
enne
ssee
Tex
asG
eorg
iaM
issi
ssip
piS
outh
Car
olin
aA
laba
ma
Loui
sian
aC
alifo
rnia
Col
orad
o
0.0
0.1
0.2
0.3
0.4
0.5
Single Linkage
Dis
sim
ilari
ty
Ver
mon
tM
aine
Nor
th D
akot
aId
aho
Sou
th D
akot
aM
inne
sota
Wis
cons
inIo
wa
New
Ham
pshi
reH
awai
iU
tah
Ore
gon
Was
hing
ton
New
Jer
sey
Okl
ahom
aIn
dian
aO
hio
Del
awar
eR
hode
Isl
and
Con
nect
icut
Mas
sach
uset
tsK
entu
cky
Wyo
min
gA
rkan
sas
Vir
gini
aW
est
Vir
gini
aK
ansa
sP
enns
ylva
nia
Mon
tana
Neb
rask
aN
orth
Car
olin
aG
eorg
iaA
laba
ma
Loui
sian
aM
issi
ssip
piS
outh
Car
olin
aF
lori
daM
ichi
gan
Mar
ylan
dN
ew M
exic
oA
rizo
naIll
inoi
sN
ew Y
ork
Mis
sour
iT
enne
ssee
Tex
asC
alifo
rnia
Col
orad
oA
lask
aN
evad
a
0.0
0.5
1.0
1.5
2.0
2.5
Complete Linkage
Dis
sim
ilari
ty
Wes
t V
irgi
nia
Nor
th D
akot
aV
erm
ont
Idah
oS
outh
Dak
ota
Min
neso
taM
aine
Wis
cons
inIo
wa
New
Ham
pshi
reH
awai
iU
tah
Ore
gon
Was
hing
ton
New
Jer
sey
Kan
sas
Pen
nsyl
vani
aM
onta
naN
ebra
ska
Wyo
min
gV
irgi
nia
Okl
ahom
aIn
dian
aO
hio
Ark
ansa
sK
entu
cky
Del
awar
eR
hode
Isl
and
Con
nect
icut
Mas
sach
uset
tsA
lask
aN
evad
aC
alifo
rnia
Col
orad
oN
orth
Car
olin
aG
eorg
iaM
issi
ssip
piS
outh
Car
olin
aA
laba
ma
Loui
sian
aF
lori
daA
rizo
naIll
inoi
sN
ew Y
ork
Mic
higa
nM
aryl
and
New
Mex
ico
Mis
sour
iT
enne
ssee
Tex
as
0.0
0.4
0.8
1.2
Group-Average Linkage
Dis
sim
ilari
ty
Nor
th C
arol
ina
Geo
rgia
Ala
bam
aLo
uisi
ana
Mis
siss
ippi
Sou
th C
arol
ina
Cal
iforn
iaC
olor
ado
Ala
ska
Nev
ada
Mis
sour
iT
enne
ssee
Tex
asF
lori
daIll
inoi
sN
ew Y
ork
Ari
zona
Mic
higa
nM
aryl
and
New
Mex
ico
Wes
t V
irgi
nia
Idah
oS
outh
Dak
ota
Min
neso
taW
isco
nsin
Iow
aN
ew H
amps
hire
Ver
mon
tM
aine
Nor
th D
akot
aN
ew J
erse
yIn
dian
aO
hio
Ken
tuck
yA
rkan
sas
Wyo
min
gO
klah
oma
Vir
gini
aH
awai
iK
ansa
sP
enns
ylva
nia
Mon
tana
Neb
rask
aU
tah
Ore
gon
Was
hing
ton
Del
awar
eR
hode
Isl
and
Con
nect
icut
Mas
sach
uset
ts
05
1015
Ward's Minimum Variance
Hei
ght
Hierarchical Clustering
# Normalizationarrest = scale( USArrests,
center = FALSE )
# Euclidean Distanced = dist( arrest )
# Dendrogramsplot( hclust( d, "single" ) )plot( hclust( d, "complete" ) )plot( hclust( d, "average" ) )plot( hclust( d, "ward" ) )
No
rth
Ca
rolin
aG
eo
rgia
Ala
ba
ma
Lo
uis
ian
aM
issi
ssip
pi
So
uth
Ca
rolin
aC
alif
orn
iaC
olo
rad
oA
lask
aN
eva
da
Mis
sou
riT
en
ne
sse
eT
exa
sF
lori
da
Illin
ois
Ne
w Y
ork
Ari
zon
aM
ich
iga
nM
ary
lan
dN
ew
Me
xico
We
st V
irg
inia
Ida
ho
So
uth
Da
kota
Min
ne
sota
Wis
con
sin
Iow
aN
ew
Ha
mp
shir
eV
erm
on
tM
ain
eN
ort
h D
ako
taN
ew
Je
rse
yIn
dia
na
Oh
ioK
en
tuck
yA
rka
nsa
sW
yom
ing
Okl
ah
om
aV
irg
inia
Ha
wa
iiK
an
sas
Pe
nn
sylv
an
iaM
on
tan
aN
eb
rask
aU
tah
Ore
go
nW
ash
ing
ton
De
law
are
Rh
od
e I
sla
nd
Co
nn
ect
icu
tM
ass
ach
use
tts
05
10
15
Cluster Dendrogram
He
igh
t
No
rth
Ca
rolin
aG
eo
rgia
Ala
ba
ma
Lo
uis
ian
aM
issi
ssip
pi
So
uth
Ca
rolin
aC
alif
orn
iaC
olo
rad
oA
lask
aN
eva
da
Mis
sou
riT
en
ne
sse
eT
exa
sF
lori
da
Illin
ois
Ne
w Y
ork
Ari
zon
aM
ich
iga
nM
ary
lan
dN
ew
Me
xico
We
st V
irg
inia
Ida
ho
So
uth
Da
kota
Min
ne
sota
Wis
con
sin
Iow
aN
ew
Ha
mp
shir
eV
erm
on
tM
ain
eN
ort
h D
ako
taN
ew
Je
rse
yIn
dia
na
Oh
ioK
en
tuck
yA
rka
nsa
sW
yom
ing
Okl
ah
om
aV
irg
inia
Ha
wa
iiK
an
sas
Pe
nn
sylv
an
iaM
on
tan
aN
eb
rask
aU
tah
Ore
go
nW
ash
ing
ton
De
law
are
Rh
od
e I
sla
nd
Co
nn
ect
icu
tM
ass
ach
use
tts
05
10
15
Cluster Dendrogram
He
igh
t
Determine Numbers of Clusters
# Using Ward's mehtodclus = hclust( d, "ward" )plot( clus )
# Cut into 3 groupsrect.hclust( clus, k = 3 )
K = 3
K = 6
Determine Significant Clusters
Clarke et al. (2008, JEMBE 366:56-69)
Similarity Profile Test
# 999 permutation# Group-average clustering# alpha = 0.05
clus2 = simprof( arrest ) simprof.plot( clus2 )
* Colors = significant clusters
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
We
st V
irg
inia
No
rth
Da
kota
Ve
rmo
nt
Ida
ho
So
uth
Da
kota
Min
ne
sota
Ma
ine
Wis
con
sin
Iow
aN
ew
Ha
mp
shir
eH
aw
aii
Uta
hO
reg
on
Wa
shin
gto
nN
ew
Je
rse
yK
an
sas
Pe
nn
sylv
an
iaM
on
tan
aN
eb
rask
aW
yom
ing
Vir
gin
iaO
kla
ho
ma
Ind
ian
aO
hio
Ark
an
sas
Ke
ntu
cky
De
law
are
Rh
od
e I
sla
nd
Co
nn
ect
icu
tM
ass
ach
use
tts
Ala
ska
Ne
vad
aC
alif
orn
iaC
olo
rad
oN
ort
h C
aro
lina
Ge
org
iaM
issi
ssip
pi
So
uth
Ca
rolin
aA
lab
am
aL
ou
isia
na
Flo
rid
aA
rizo
na
Illin
ois
Ne
w Y
ork
Mic
hig
an
Ma
ryla
nd
Ne
w M
exi
coM
isso
uri
Te
nn
ess
ee
Te
xas
Motivations for Ordination
• Dendrogram is still difficult to understand
• Clustering forced samples into groups despites the compositional changes may be continuous.
• Ordination reduces dimensionality of multivariate data (data cloud so to speak)
• Preferably, capture majority of the information as bivariate data frame, so the multivariate patterns can be shown on a scatter plot.
Principal Component Analysis (PCA)
Clarke & Warwick (2001)
2 species example
Principal Component Analysis (PCA)
• PC1 maximizes variance of points projected on it.
• PC2 is perpendicular to PC1
• PC3 is perpendicular to PC1 and PC2
• New orthogonal axes are linear combination of old data:
PC1 = 0.62 Sp1 + 0.52 Sp2 + 0.58 Sp3PC2 = -0.73 Sp1 + 0.65 Sp2 + 0.2 Sp3PC3 = 0.28 Sp 1 + 0.55 Sp2 -0.79 Sp3
Clarke & Warwick (2001)
3 species example
Principal Component Analysis (PCA)
# PCApca = princomp( arrest )
# New orthogonal axespairs( pca$scores )
Comp.1
-0.5 0.0 0.5 -0.4 0.0 0.2 0.4
-1.0
0.0
1.0
-0.5
0.0
0.5
Comp.2
Comp.3
-0.6
-0.2
0.2
-1.0 0.0 1.0
-0.4
0.0
0.2
0.4
-0.6 -0.2 0.2
Comp.4
Comp.1 Comp.2 Comp.3 Comp.4
Va
ria
nce
s
0.0
0.1
0.2
0.3
0.4
Principal Component Analysis (PCA)
# Variable contributions# PC1 = -0.65 Murder -0.6 Assault -0.46 Rapepca$loading
# Variance of PC axesplot( pca )# Total variance explainedsummary( pca )
Principal Component Analysis (PCA)
-1.0 -0.5 0.0 0.5 1.0
-0.5
0.0
0.5
Comp.1
Co
mp
.2
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
DelawareFlorida
Georgia
Hawaii
IdahoIllinois IndianaIowa
Kansas
Kentucky
Louisiana
Maine
Maryland
MassachusettsMichigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New JerseyNew Mexico
New York
North Carolina
North Dakota
OhioOklahoma
Oregon
PennsylvaniaRhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
#Cut dentrogram for 6 clustergroup = cutree( clus, 6 )
plot( pca$scores, type = "n" )
text( pca$scores, names( group ), col = group )
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
Comp.1
Co
mp
.2
Alabama
AlaskaArizona
Arkansas
CaliforniaColorado
ConnecticutDelawareFlorida
Georgia
Hawaii
IdahoIllinois Indiana IowaKansas
KentuckyLouisiana
MaineMaryland
MassachusettsMichiganMinnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New JerseyNew MexicoNew York
North Carolina
North Dakota
OhioOklahoma
Oregon
PennsylvaniaRhode Island
South Carolina
South DakotaTennessee
Texas
Utah
VermontVirginia
Washington
West Virginia
Wisconsin
Wyoming
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
Murder
Assault
UrbanPop
Rape
Principal Component Analysis (PCA)
# Add variable contributionsbiplot( pca, scale = 0 )
Non-Metric Multidimensional Scaling (nMDS)
• Ordination bases on ranked resemblance (or distance) matrix
• Robust and flexible for all kind of resemblance indices
• Using iterative procedure, successively refine the locations of ordination points according to the ranked dissimilarities of samples
• Better choice for species abundance data (comparing to PCA)
0.0 0.1 0.2 0.3 0.4 0.5 0.6
01
23
Observed Dissimilarity
Ord
ina
tion
Dis
tan
ceNon-metric fit, R2 = 0.995 Linear fit, R2 = 0.98
Multidimensional Scaling (nMDS)
mds = metaMDS( arrest )stressplot( mds )
Multidimensional Scaling (nMDS)
-0.4 -0.2 0.0 0.2 0.4 0.6 0.8
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
MDS1
MD
S2
Alabama
Alaska
Arizona
Arkansas
CaliforniaColorado
Connecticut
DelawareFlorida
Georgia
Hawaii
Idaho
Illinois
IndianaIowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
MichiganMinnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New MexicoNew York
North Carolina
North Dakota
OhioOklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
-0.4 -0.2 0.0 0.2 0.4 0.6 0.8
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
MDS1M
DS
2
Alabama
Alaska
Arizona
Arkansas
CaliforniaColorado
Connecticut
DelawareFlorida
Georgia
Hawaii
Idaho
Illinois
IndianaIowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New MexicoNew York
North Carolina
North Dakota
OhioOklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
-0.4 -0.2 0.0 0.2 0.4 0.6
-0.3
-0.2
-0.1
0.0
0.1
0.2Murder
Assault
UrbanPop
Rape
# Ordination with 6 clustersplot( mds$points, type = "n" )text( mds$points, names( group ),
pch = group, col = group)
# Add variable score# Weighted averagebiplot( mds$points ,
mds$species )
2 4 6 8
0.2
0.4
0.6
0.8
env.dist
veg
.dis
t
Correlation between Matrices
# Vegetation and environment # in lichen pasturesdata( varespec )data( varechem )
# Bray-Crutis Dissimilarityveg.dist = vegdist( varespec )
# Euclidean distanceenv.dist = dist( scale( varechem ) )
Mantel Test
ρCorrelationSites
Species
Site
s
Site
s 1, 2, 3,……....BC Rank
Environ.
Sites
Site
s
Site
s 1, 2, 3,……....ED Rank
Pearson Correlation (r)
Fre
qu
en
cy
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
02
04
06
08
01
00
r = 0.3
# Mantel test# Based on 999 permutations# Pearson's correlationman = mantel( veg.dist, env.dist )man
# Distribution of permuted rhist ( man$perm )
Mantel Test
Best Environmental Subsets
ρCorrelationSites
Species
Site
s
Site
s 1, 2, 3,……....BC Rank
Environ.
Sites
Site
s
Site
s 1, 2, 3,……....ED Rank
BIOENV
bioenv( varespec, varechem )# 16383 possible subsets# Subset of environmental variables
with best correlation to community data
1 2 3 4 5
0.2
0.4
0.6
0.8
env.dist (N + P + Al + Mn + Baresoil)
veg
.dis
t
Testing Group Difference for Community Data
data( dune ) #Vegetation in Dutch Dune Meadowsdune
# More species (variables) than samples# Dominance of zero values# Violates multivariate normality and constant variance across
the groups# A robust, permuatation-based test is needed for community
data.
Analysis of Similarity (ANOSIM)
• R = 1: Within group are more similar than between groups
• R = 0: Between and within group are the same in average
• R is an absolute measure of group seperation
Sites
Species
Site
s
Site
s 1, 2, 3,……....BC Rank
4/)1(
nn
rrR WB
rB = Avg. rank between groupsrW = Avg. rank within groupsn = sample size
Analysis of Similarity (ANOSIM)# Environment factors in Dutch Dune
Meadowsdata( dune.env )
# Does moisture has effect on vegetation?
Moisture = as.numeric( dune.env$Moisture )
# Run a MDS on dune vegetationmds = metaMDS( dune )
# MDS plot seems to suggest moisture effect
plot( mds$points, pch = 21,bg = Moisture, cex = Moisture )
-0.5 0.0 0.5 1.0
-0.5
0.0
0.5
1.0
Vegetation in Dutch Dune Meadows
MDS1
MD
S2
Moisture
1
2
3
4
Analysis of Similarity (ANOSIM)
aos = anosim( dune, Moisture )
aos
# Distribution of permuted Rhist( aos$perm )
ANOSIM R-statistics
Fre
qu
en
cy
-0.2 0.0 0.2 0.4
05
01
00
15
02
00
R = 0.43
Other Useful FunctionsClustering: • pam() for clustering around medoids and clara() for clustering large data (both in
“cluster”) • pvclust() in “pvclust” for assessing the uncertainty in hierarchical cluster analysis
Ordination: • Great PCA video explanation on YOUTUBE• imputePCA() in “missMDA” for handling missing data• cca() and rda() in “vegan” for constrained type of ordinations
Testing difference: • mrpp() in “vegan” for ANOSIM type analysis but using original dissimilarities instead
of their ranks. • adonis() in “vegan” for robust and flexible multivariate permutational analysis of
variance (e.g. factorial & nested design, mixed model, etc.)• betadisper() in “vegan” for testing constant multivariate variance (or dispersion)