Introduction to Multivariate Analysis Biology 4605/7220 Chih-Lin Wei Canadian Health Oceans Network Postdoc Fellow Ocean Science Centre, MUN

Introduction to Multivariate Analysis

Biology 4605/7220Chih-Lin Wei

Canadian Health Oceans Network Postdoc FellowOcean Science Centre, MUN

My Background

• Benthic ecologist: Community ecologyHow environments control macroecological patterns in the deep-seaInterested in R but “NOT a statistician”.

• Education: BS in Zoology in Taiwan; MS & PhD in Biological Oceanography, Texas A&M University

• Current project: Scale-up regional benthic diversity and standing stock pattern using ecological modeling approaches

Lecture Contents

• Visualization• Resemblance index• Cluster analysis• Ordination• Correlation• Testing for difference• Other stuff

Clarke & Warwick (2001)

Front Matter

• Mostly non-parametric, permutation-based techniques

• Start with graphical concept

• Followed by examples in simple R codes

• No more than 3 lines of code for each example

• Most functions in base R or package “vegan”

• All analyses are available on commercial software (PRIMER-E) [demo version]

http://www.primer-e.com/demo/demo.htm

R packages

# Install and load R Packagesinstall.packages( c("vegan", "scatterplot3d", "reshape2", "lattice", "clustsig") )library( vegan )library( scatterplot3d )library( reshape2 )library( lattice )library( clustsig)

First thing first, plot the data

5 10 15

50

10

01

50

20

02

50

30

0

Murder

Ass

au

lt

# Violent Crime Rates by US State

USArrests

plot( USArrests[,1:2] )

3D Scatter Plot

scatterplot3d( USArrests[,1:3] )

0 5 10 15 20

30

40

50

60

70

80

90

100

0 50

100150

200250

300350

Murder

Ass

au

lt

Urb

an

Po

p

Scatterplot Matrices

pairs( USArrests )Murder

50 150 250 10 20 30 40

510

15

5015

025

0Assault

UrbanPop

3050

7090

5 10 15

1020

3040

30 50 70 90

Rape

Lattice Graphs

# Melt dataframe to flat formatm = melt( USArrests,

id.vars = "Assault" )m

# Multipanel scatter plotxyplot( value ~ Assault | variable,

data = m )

Assault

valu

e

0

20

40

60

80

50 100 150 200 250 300 350

Murder UrbanPop0

20

40

60

80

Rape

Resemblance/distance Indices

•


*Not good for data with lots of zero(e.g. species abundance)


• • D = 0, if species are identical in 2 samples

• D = 1, if 2 samples have no species in common

• Better for species abundance data (with lots of zero)


# Euclidean Distance:

dist( USArrests )

# Bray-Crutis Dissimilarity# Vegetation in lichen pasturesdata( varespec )varespec

vegdist( varespec )

1 3 2 4

0.3

0.4

0.5

0.6

0.7

Cluster Dendrogram

Dis

sim

ilari

ty

0.0

0.2

0.4

0.6

Hierarchical Clustering• Patterns in distance or

dissimilarity matrix is difficult to detect.

• Find natural grouping by successive fusing of samples

Hierarchical Clustering

Linkage Options:

•Single linkage (neareast neighbour clustering)

•Complete linkage (furthest neighbour clustering)

•Group-average linkage

•Ward’s minimum variance

Group 1 Group 2

Sp 1

Sp 2

Single Link

Complete Link

Ala

ska

Flo

rida

Del

awar

eH

awai

iR

hode

Isl

and

Ken

tuck

yM

isso

uri

Uta

hO

rego

nW

ashi

ngto

nM

assa

chus

etts

New

Jer

sey

Con

nect

icut

Ver

mon

tW

est

Vir

gini

aA

rkan

sas

Sou

th D

akot

aId

aho

Nor

th D

akot

aM

inne

sota

Mai

neW

isco

nsin

Iow

aN

ew H

amps

hire

Wyo

min

gV

irgi

nia

Neb

rask

aO

klah

oma

Mon

tana

Indi

ana

Ohi

oK

ansa

sP

enns

ylva

nia

Nev

ada

Nor

th C

arol

ina

Ari

zona

Mic

higa

nM

aryl

and

New

Mex

ico

Illin

ois

New

Yor

kT

enne

ssee

Tex

asG

eorg

iaM

issi

ssip

piS

outh

Car

olin

aA

laba

ma

Loui

sian

aC

alifo

rnia

Col

orad

o

0.0

0.1

0.2

0.3

0.4

0.5

Single Linkage

Dis

sim

ilari

ty

Ver

mon

tM

aine

Nor

th D

akot

aId

aho

Sou

th D

akot

aM

inne

sota

Wis

cons

inIo

wa

New

Ham

pshi

reH

awai

iU

tah

Ore

gon

Was

hing

ton

New

Jer

sey

Okl

ahom

aIn

dian

aO

hio

Del

awar

eR

hode

Isl

and

Con

nect

icut

Mas

sach

uset

tsK

entu

cky

Wyo

min

gA

rkan

sas

Vir

gini

aW

est

Vir

gini

aK

ansa

sP

enns

ylva

nia

Mon

tana

Neb

rask

aN

orth

Car

olin

aG

eorg

iaA

laba

ma

Loui

sian

aM

issi

ssip

piS

outh

Car

olin

aF

lori

daM

ichi

gan

Mar

ylan

dN

ew M

exic

oA

rizo

naIll

inoi

sN

ew Y

ork

Mis

sour

iT

enne

ssee

Tex

asC

alifo

rnia

Col

orad

oA

lask

aN

evad

a

0.0

0.5

1.0

1.5

2.0

2.5

Complete Linkage

Dis

sim

ilari

ty

Wes

t V

irgi

nia

Nor

th D

akot

aV

erm

ont

Idah

oS

outh

Dak

ota

Min

neso

taM

aine

Wis

cons

inIo

wa

New

Ham

pshi

reH

awai

iU

tah

Ore

gon

Was

hing

ton

New

Jer

sey

Kan

sas

Pen

nsyl

vani

aM

onta

naN

ebra

ska

Wyo

min

gV

irgi

nia

Okl

ahom

aIn

dian

aO

hio

Ark

ansa

sK

entu

cky

Del

awar

eR

hode

Isl

and

Con

nect

icut

Mas

sach

uset

tsA

lask

aN

evad

aC

alifo

rnia

Col

orad

oN

orth

Car

olin

aG

eorg

iaM

issi

ssip

piS

outh

Car

olin

aA

laba

ma

Loui

sian

aF

lori

daA

rizo

naIll

inoi

sN

ew Y

ork

Mic

higa

nM

aryl

and

New

Mex

ico

Mis

sour

iT

enne

ssee

Tex

as

0.0

0.4

0.8

1.2

Group-Average Linkage

Dis

sim

ilari

ty

Nor

th C

arol

ina

Geo

rgia

Ala

bam

aLo

uisi

ana

Mis

siss

ippi

Sou

th C

arol

ina

Cal

iforn

iaC

olor

ado

Ala

ska

Nev

ada

Mis

sour

iT

enne

ssee

Tex

asF

lori

daIll

inoi

sN

ew Y

ork

Ari

zona

Mic

higa

nM

aryl

and

New

Mex

ico

Wes

t V

irgi

nia

Idah

oS

outh

Dak

ota

Min

neso

taW

isco

nsin

Iow

aN

ew H

amps

hire

Ver

mon

tM

aine

Nor

th D

akot

aN

ew J

erse

yIn

dian

aO

hio

Ken

tuck

yA

rkan

sas

Wyo

min

gO

klah

oma

Vir

gini

aH

awai

iK

ansa

sP

enns

ylva

nia

Mon

tana

Neb

rask

aU

tah

Ore

gon

Was

hing

ton

Del

awar

eR

hode

Isl

and

Con

nect

icut

Mas

sach

uset

ts

05

1015

Ward's Minimum Variance

Hei

ght

Hierarchical Clustering

# Normalizationarrest = scale( USArrests,

center = FALSE )

# Euclidean Distanced = dist( arrest )

# Dendrogramsplot( hclust( d, "single" ) )plot( hclust( d, "complete" ) )plot( hclust( d, "average" ) )plot( hclust( d, "ward" ) )

No

rth

Ca

rolin

aG

eo

rgia

Ala

ba

ma

Lo

uis

ian

aM

issi

ssip

pi

So

uth

Ca

rolin

aC

alif

orn

iaC

olo

rad

oA

lask

aN

eva

da

Mis

sou

riT

en

ne

sse

eT

exa

sF

lori

da

Illin

ois

Ne

w Y

ork

Ari

zon

aM

ich

iga

nM

ary

lan

dN

ew

Me

xico

We

st V

irg

inia

Ida

ho

So

uth

Da

kota

Min

ne

sota

Wis

con

sin

Iow

aN

ew

Ha

mp

shir

eV

erm

on

tM

ain

eN

ort

h D

ako

taN

ew

Je

rse

yIn

dia

na

Oh

ioK

en

tuck

yA

rka

nsa

sW

yom

ing

Okl

ah

om

aV

irg

inia

Ha

wa

iiK

an

sas

Pe

nn

sylv

an

iaM

on

tan

aN

eb

rask

aU

tah

Ore

go

nW

ash

ing

ton

De

law

are

Rh

od

e I

sla

nd

Co

nn

ect

icu

tM

ass

ach

use

tts

05

10

15

Cluster Dendrogram

He

igh

t

No

rth

Ca

rolin

aG

eo

rgia

Ala

ba

ma

Lo

uis

ian

aM

issi

ssip

pi

So

uth

Ca

rolin

aC

alif

orn

iaC

olo

rad

oA

lask

aN

eva

da

Mis

sou

riT

en

ne

sse

eT

exa

sF

lori

da

Illin

ois

Ne

w Y

ork

Ari

zon

aM

ich

iga

nM

ary

lan

dN

ew

Me

xico

We

st V

irg

inia

Ida

ho

So

uth

Da

kota

Min

ne

sota

Wis

con

sin

Iow

aN

ew

Ha

mp

shir

eV

erm

on

tM

ain

eN

ort

h D

ako

taN

ew

Je

rse

yIn

dia

na

Oh

ioK

en

tuck

yA

rka

nsa

sW

yom

ing

Okl

ah

om

aV

irg

inia

Ha

wa

iiK

an

sas

Pe

nn

sylv

an

iaM

on

tan

aN

eb

rask

aU

tah

Ore

go

nW

ash

ing

ton

De

law

are

Rh

od

e I

sla

nd

Co

nn

ect

icu

tM

ass

ach

use

tts

05

10

15

Cluster Dendrogram

He

igh

t

Determine Numbers of Clusters

# Using Ward's mehtodclus = hclust( d, "ward" )plot( clus )

# Cut into 3 groupsrect.hclust( clus, k = 3 )

K = 3

K = 6

Determine Significant Clusters

Clarke et al. (2008, JEMBE 366:56-69)

Similarity Profile Test

# 999 permutation# Group-average clustering# alpha = 0.05

clus2 = simprof( arrest ) simprof.plot( clus2 )

* Colors = significant clusters

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

We

st V

irg

inia

No

rth

Da

kota

Ve

rmo

nt

Ida

ho

So

uth

Da

kota

Min

ne

sota

Ma

ine

Wis

con

sin

Iow

aN

ew

Ha

mp

shir

eH

aw

aii

Uta

hO

reg

on

Wa

shin

gto

nN

ew

Je

rse

yK

an

sas

Pe

nn

sylv

an

iaM

on

tan

aN

eb

rask

aW

yom

ing

Vir

gin

iaO

kla

ho

ma

Ind

ian

aO

hio

Ark

an

sas

Ke

ntu

cky

De

law

are

Rh

od

e I

sla

nd

Co

nn

ect

icu

tM

ass

ach

use

tts

Ala

ska

Ne

vad

aC

alif

orn

iaC

olo

rad

oN

ort

h C

aro

lina

Ge

org

iaM

issi

ssip

pi

So

uth

Ca

rolin

aA

lab

am

aL

ou

isia

na

Flo

rid

aA

rizo

na

Illin

ois

Ne

w Y

ork

Mic

hig

an

Ma

ryla

nd

Ne

w M

exi

coM

isso

uri

Te

nn

ess

ee

Te

xas

Motivations for Ordination

• Dendrogram is still difficult to understand

• Clustering forced samples into groups despites the compositional changes may be continuous.

• Ordination reduces dimensionality of multivariate data (data cloud so to speak)

• Preferably, capture majority of the information as bivariate data frame, so the multivariate patterns can be shown on a scatter plot.

Principal Component Analysis (PCA)


2 species example


• PC1 maximizes variance of points projected on it.

• PC2 is perpendicular to PC1

• PC3 is perpendicular to PC1 and PC2

• New orthogonal axes are linear combination of old data:

PC1 = 0.62 Sp1 + 0.52 Sp2 + 0.58 Sp3PC2 = -0.73 Sp1 + 0.65 Sp2 + 0.2 Sp3PC3 = 0.28 Sp 1 + 0.55 Sp2 -0.79 Sp3


3 species example


# PCApca = princomp( arrest )

# New orthogonal axespairs( pca$scores )

Comp.1

-0.5 0.0 0.5 -0.4 0.0 0.2 0.4

-1.0

0.0

1.0

-0.5

0.0

0.5

Comp.2

Comp.3

-0.6

-0.2

0.2

-1.0 0.0 1.0

-0.4

0.0

0.2

0.4

-0.6 -0.2 0.2

Comp.4

Comp.1 Comp.2 Comp.3 Comp.4

Va

ria

nce

s

0.0

0.1

0.2

0.3

0.4


# Variable contributions# PC1 = -0.65 Murder -0.6 Assault -0.46 Rapepca$loading

# Variance of PC axesplot( pca )# Total variance explainedsummary( pca )


-1.0 -0.5 0.0 0.5 1.0

-0.5

0.0

0.5

Comp.1

Co

mp

.2

Alabama

Alaska

Arizona

Arkansas

California

Colorado

Connecticut

DelawareFlorida

Georgia

Hawaii

IdahoIllinois IndianaIowa

Kansas

Kentucky

Louisiana

Maine

Maryland

MassachusettsMichigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New JerseyNew Mexico

New York

North Carolina

North Dakota

OhioOklahoma

Oregon

PennsylvaniaRhode Island

South Carolina

South Dakota

Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

#Cut dentrogram for 6 clustergroup = cutree( clus, 6 )

plot( pca$scores, type = "n" )

text( pca$scores, names( group ), col = group )

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Comp.1

Co

mp

.2

Alabama

AlaskaArizona

Arkansas

CaliforniaColorado

ConnecticutDelawareFlorida

Georgia

Hawaii

IdahoIllinois Indiana IowaKansas

KentuckyLouisiana

MaineMaryland

MassachusettsMichiganMinnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New JerseyNew MexicoNew York

North Carolina

North Dakota

OhioOklahoma

Oregon

PennsylvaniaRhode Island

South Carolina

South DakotaTennessee

Texas

Utah

VermontVirginia

Washington

West Virginia

Wisconsin

Wyoming

-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

Murder

Assault

UrbanPop

Rape


# Add variable contributionsbiplot( pca, scale = 0 )

Non-Metric Multidimensional Scaling (nMDS)

• Ordination bases on ranked resemblance (or distance) matrix

• Robust and flexible for all kind of resemblance indices

• Using iterative procedure, successively refine the locations of ordination points according to the ranked dissimilarities of samples

• Better choice for species abundance data (comparing to PCA)

0.0 0.1 0.2 0.3 0.4 0.5 0.6

01

23

Observed Dissimilarity

Ord

ina

tion

Dis

tan

ceNon-metric fit, R2 = 0.995 Linear fit, R2 = 0.98

Multidimensional Scaling (nMDS)

mds = metaMDS( arrest )stressplot( mds )

Multidimensional Scaling (nMDS)

-0.4 -0.2 0.0 0.2 0.4 0.6 0.8

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

MDS1

MD

S2

Alabama

Alaska

Arizona

Arkansas

CaliforniaColorado

Connecticut

DelawareFlorida

Georgia

Hawaii

Idaho

Illinois

IndianaIowa

Kansas

Kentucky

Louisiana

Maine

Maryland

Massachusetts

MichiganMinnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New MexicoNew York

North Carolina

North Dakota

OhioOklahoma

Oregon

Pennsylvania

Rhode Island

South Carolina

South Dakota

Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

-0.4 -0.2 0.0 0.2 0.4 0.6 0.8

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

MDS1M

DS

2

Alabama

Alaska

Arizona

Arkansas

CaliforniaColorado

Connecticut

DelawareFlorida

Georgia

Hawaii

Idaho

Illinois

IndianaIowa

Kansas

Kentucky

Louisiana

Maine

Maryland

Massachusetts

Michigan Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New MexicoNew York

North Carolina

North Dakota

OhioOklahoma

Oregon

Pennsylvania

Rhode Island

South Carolina

South Dakota

Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

-0.4 -0.2 0.0 0.2 0.4 0.6

-0.3

-0.2

-0.1

0.0

0.1

0.2Murder

Assault

UrbanPop

Rape

# Ordination with 6 clustersplot( mds$points, type = "n" )text( mds$points, names( group ),

pch = group, col = group)

# Add variable score# Weighted averagebiplot( mds$points ,

mds$species )

2 4 6 8

0.2

0.4

0.6

0.8

env.dist

veg

.dis

t

Correlation between Matrices

# Vegetation and environment # in lichen pasturesdata( varespec )data( varechem )

# Bray-Crutis Dissimilarityveg.dist = vegdist( varespec )

# Euclidean distanceenv.dist = dist( scale( varechem ) )

Mantel Test

ρCorrelationSites

Species

Site

s

Site

s 1, 2, 3,……....BC Rank

Environ.

Sites

Site

s

Site

s 1, 2, 3,……....ED Rank

Pearson Correlation (r)

Fre

qu

en

cy

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4

02

04

06

08

01

00

r = 0.3

# Mantel test# Based on 999 permutations# Pearson's correlationman = mantel( veg.dist, env.dist )man

# Distribution of permuted rhist ( man$perm )

Mantel Test

Best Environmental Subsets

ρCorrelationSites

Species

Site

s

Site

s 1, 2, 3,……....BC Rank

Environ.

Sites

Site

s

Site

s 1, 2, 3,……....ED Rank

BIOENV

bioenv( varespec, varechem )# 16383 possible subsets# Subset of environmental variables

with best correlation to community data

1 2 3 4 5

0.2

0.4

0.6

0.8

env.dist (N + P + Al + Mn + Baresoil)

veg

.dis

t

Testing Group Difference for Community Data

data( dune ) #Vegetation in Dutch Dune Meadowsdune

# More species (variables) than samples# Dominance of zero values# Violates multivariate normality and constant variance across

the groups# A robust, permuatation-based test is needed for community

data.

Analysis of Similarity (ANOSIM)

• R = 1: Within group are more similar than between groups

• R = 0: Between and within group are the same in average

• R is an absolute measure of group seperation

Sites

Species

Site

s

Site

s 1, 2, 3,……....BC Rank

4/)1(

nn

rrR WB

rB = Avg. rank between groupsrW = Avg. rank within groupsn = sample size

Analysis of Similarity (ANOSIM)# Environment factors in Dutch Dune

Meadowsdata( dune.env )

# Does moisture has effect on vegetation?

Moisture = as.numeric( dune.env$Moisture )

# Run a MDS on dune vegetationmds = metaMDS( dune )

# MDS plot seems to suggest moisture effect

plot( mds$points, pch = 21,bg = Moisture, cex = Moisture )

-0.5 0.0 0.5 1.0

-0.5

0.0

0.5

1.0

Vegetation in Dutch Dune Meadows

MDS1

MD

S2

Moisture

1

2

3

4

Analysis of Similarity (ANOSIM)

aos = anosim( dune, Moisture )

aos

# Distribution of permuted Rhist( aos$perm )

ANOSIM R-statistics

Fre

qu

en

cy

-0.2 0.0 0.2 0.4

05

01

00

15

02

00

R = 0.43

Other Useful FunctionsClustering: • pam() for clustering around medoids and clara() for clustering large data (both in

“cluster”) • pvclust() in “pvclust” for assessing the uncertainty in hierarchical cluster analysis

Ordination: • Great PCA video explanation on YOUTUBE• imputePCA() in “missMDA” for handling missing data• cca() and rda() in “vegan” for constrained type of ordinations

Testing difference: • mrpp() in “vegan” for ANOSIM type analysis but using original dissimilarities instead

of their ranks. • adonis() in “vegan” for robust and flexible multivariate permutational analysis of

variance (e.g. factorial & nested design, mixed model, etc.)• betadisper() in “vegan” for testing constant multivariate variance (or dispersion)

http://www.youtube.com/watch?v=BfTMmoDFXyE&feature=share&list=PL269700C504BC6944

Documents

Introduction to Multivariate Analysis Biology 4605/7220 Chih-Lin Wei Canadian Health Oceans Network Postdoc Fellow Ocean Science Centre, MUN