Multivariate analysis of genetic data: investigating spatial ...adegenet.r-forge.r-project.org/files/Glasgow2015/...Multivariate analysis of genetic data: investigating spatial structures

Multivariate analysis of genetic data: investigatingspatial structures

Thibaut Jombart∗

Imperial College London

MRC Centre for Outbreak Analysis and Modelling

July 30, 2015

Abstract

This practical provides an introduction to the analysis of spatial genetic structuresusing R. Using an empirical dataset of microsatellites sampled from wild goats in theBauges mountains (France), we illustrate univariate and multivariate tests of spatialstructures, and then introduce the spatial Principal Component Analysis (sPCA)for uncovering spatial genetic patterns. For a more complete overview of spatialgenetics methods, see the basics and sPCA tutorials for adegenet, which you canopen using adegenetTutorial("basics") and adegenetTutorial("spca") (with aninternet connection) or from the adegenet website.

∗[email protected]

1

Contents

1 An overview of the data 3

2 Summarising the genetic diversity 5

3 Mapping and testing PCA results 12

4 Multivariate tests of spatial structure 18

5 spatial Principal Component Analysis 19

6 To go further 24

2

The chamois (Rupicapra rupicapra) is a conserved species in France. The Baugesmountains is a protected area in which the species has been recently studied. One of themost important questions for conservation purposes relates to whether individuals from thisarea form a single reproductive unit, or whether they are structured into sub-groups, and ifso, what causes are likely to induce this structuring.

While field observations are very scarce and do not allow to answer this question, geneticdata can be used to tackle the issue, as departure from panmixia should result in geneticstructuring. The dataset rupica contains 335 georeferenced genotypes of Chamois from theBauges mountains for 9 microsatellite markers, which we propose to analyse in this exercise.

1 An overview of the data

We first load some required packages and the data:

library(spdep)

library(adehabitat)

library(ade4)

library(adegenet)

data(rupica)

rupica

## /// GENIND OBJECT /////////

##

## // 335 individuals; 9 loci; 55 alleles; size: 217.2 Kb

##

## // Basic content

## @tab: 335 x 55 matrix of allele counts

## @loc.n.all: number of alleles per locus (range: 4-10)

## @loc.fac: locus factor for the 55 columns of @tab

## @all.names: list of allele names for each locus

## @ploidy: ploidy of each individual (range: 2-2)

## @type: codom

## @call: NULL

##

## // Optional content

## @other: a list containing: xy mnt showBauges

rupica is a typical genind object, which is the class of objects storing genotypes (asopposed to population data) in adegenet. rupica also contains topographic informationabout the sampled area, which can be displayed by calling rupica$other$showBauges. Forinstance, the spatial distribution of the sampling can be displayed as follows:

3

other(rupica)$showBauges()

points(other(rupica)$xy, col="red",pch=20)

500

500

500

500

1000

1000

100

0

1000

150

0

150

0

1500

1500 1

500

1500

1500

2000

200

0

5 km N

●

●

●

●

●●

●

●●

●

●

● ●●

●● ●

●●

●

●

●

●●

●

●

●

●●

●

●

●●●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●● ●

● ●●

●

●●

●

●●

●

●

●

●●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●●● ●

●

●

●

●

●

●

●

●

●●●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●●

●

This spatial distribution is clearly not random, but seems arranged into loose clusters.However, superimposed samples can bias our visual assessment of the spatial clustering. Usea two-dimensional kernel density estimation (function s.kde2d) to overcome this possibleissue.

other(rupica)$showBauges()

s.kde2d(other(rupica)$xy,add.plot=TRUE)

points(other(rupica)$xy, col="red",pch=20)

4

500

500

500

500

1000

1000

100

0

1000

150

0

150

0

1500

1500

150

0

1500

1500

2000

200

0

5 km N

●

●

●

●

●●

●

●●

●

●

● ●●

●● ●

●●

●

●

●

●●

●

●

●

●●

●

●

●●●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●● ●

● ●●

●

●●

●

●●

●

●

●

●●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●●● ●

●

●

●

●

●

●

●

●

●●●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●●

●

●

● ●●

●● ●

●●

●

●

●

●●

●

●

●

●●

●

●

●●●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●● ●

● ●●

●

●●

●

●●

●

●

●

●●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●●● ●

●

●

●

●

●

●

●

●

●●●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●●

●

Is geographical clustering strong enough to assign safely each individual to a group?Accordingly, shall we analyse these data at individual or group level?

2 Summarising the genetic diversity

As a prior clustering of genotypes is not known, we cannot employ usual FST -basedapproaches to detect genetic structuring. However, genetic structure could still result ina deficit of heterozygosity. Use the summary of genind objects to compare expected andobserved heterozygosity:

rupica.smry <- summary(rupica)

##

## # Total number of genotypes: 335

##

## # Population sample sizes:

## P1

## 335

5

##

## # Number of alleles per locus:

## Bm1818 Bm203 Eth225 Hel1 Bm4505 Bmc1009 Ilst030 Maf70 Sr

## 7 10 7 6 5 5 6 4 5

##

## # Number of alleles per population:

## P1

## 55

##

## # Percentage of missing data:

## [1] 0

##

## # Observed heterozygosity:

## Bm1818 Bm203 Eth225 Hel1 Bm4505 Bmc1009 Ilst030

## 0.5880597 0.6208955 0.5253731 0.7492537 0.6537313 0.5283582 0.6298507

## Maf70 Sr

## 0.5552239 0.4149254

##

## # Expected heterozygosity:

## Bm1818 Bm203 Eth225 Hel1 Bm4505 Bmc1009 Ilst030

## 0.6072158 0.6532517 0.5314591 0.7259626 0.6603405 0.5706082 0.6393235

## Maf70 Sr

## 0.5473112 0.4039120

plot(rupica.smry$Hexp, rupica.smry$Hobs, main="Observed vs expected heterozygosity")

abline(0,1,col="red")

6

●

●

●

●

●

●

●

●

●

0.40 0.45 0.50 0.55 0.60 0.65 0.70

0.45

0.50

0.55

0.60

0.65

0.70

0.75

Observed vs expected heterozygosity

rupica.smry$Hexp

rupi

ca.s

mry

$Hob

s

The red line indicate identity between both quantities. What can we say aboutheterozygosity in this population? How can this be tested? The result below can bereproduced using a standard testing procedure:

t.test(rupica.smry$Hexp, rupica.smry$Hobs,paired=TRUE,var.equal=TRUE)

##

## Paired t-test

##

## data: rupica.smry$Hexp and rupica.smry$Hobs

## t = 1.1761, df = 8, p-value = 0.2734

## alternative hypothesis: true difference in means is not equal to 0

## 95 percent confidence interval:

## -0.007869215 0.024249885

## sample estimates:

## mean of the differences

## 0.008190335

7

We can seek a global picture of the genetic diversity among genotypes using a PrincipalComponent Analysis (PCA, [4, 2], dudi.pca in ade4 package). The analysis is performed ona table of allele frequencies, obtained by tab. The function dudi.pca displays a barplot ofeigenvalues and asks for a number of retained principal components:

rupica.X <- tab(rupica, freq=TRUE, NA.method="mean")

rupica.pca1 <- dudi.pca(rupica.X, cent=TRUE, scannf=FALSE, nf=2)

barplot(rupica.pca1$eig)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

The output produced by dudi.pca is a dudi object. A dudi object contains variousinformation; in the case of PCA, principal axes (loadings), principal components (syntheticvariable), and eigenvalues are respectively stored in $c1, $li, and $eig slots. Here is thecontent of the PCA:

rupica.pca1

## Duality diagramm

## class: pca dudi

8

## $call: dudi.pca(df = rupica.X, center = TRUE, scannf = FALSE, nf = 2)

##

## $nf: 2 axis-components saved

## $rank: 51

## eigen values: 3.013 2.563 2.271 2.107 2.1 ...

## vector length mode content

## 1 $cw 55 numeric column weights

## 2 $lw 335 numeric row weights

## 3 $eig 51 numeric eigen values

##

## data.frame nrow ncol content

## 1 $tab 335 55 modified array

## 2 $li 335 2 row coordinates

## 3 $l1 335 2 row normed scores

## 4 $co 55 2 column coordinates

## 5 $c1 55 2 column normed scores

## other elements: cent norm

In general, eigenvalues represent the amount of genetic diversity — as measured by themultivariate method being used — represented by each principal component (PC). Verifythat here, each eigenvalue is the variance of the corresponding PC.

head(rupica.pca1$eig)

## [1] 3.013193 2.562510 2.271093 2.107049 2.099738 1.971703

apply(rupica.pca1$li,2,var)*334/335

## Axis1 Axis2

## 3.013193 2.562510

An abrupt decrease in eigenvalues is likely to indicate the boundary between truepatterns and non-interpretable structures. In this case, how many PCs would you interprete?

Use s.label to display to two first components of the analysis. Then, use a kernel density(s.kde2d) for a better assessment of the distribution of the genotypes onto the principal axes:

s.label(rupica.pca1$li)

s.kde2d(rupica.pca1$li, add.p=TRUE, cpoint=0)

add.scatter.eig(rupica.pca1$eig,2,1,2)

9

d = 5

8

63 64 66 67 68 69

70

71 72

74 76

77 78 81

82

83

85 86

162 163

164

165

169

170

545 556 599 600

616

633 636

664

665

669

734

735

862 863 1170

1171

1429 1474 2000 2002

2004

2051

2053

2057

2158

2183

2198

2307

2309

2317

2319

2320 2321

2323

2329

2332 2333 2340

2344 2347

2350

2353

2355

2357

2358

2360

2372 2373 2374

2379 2401

2402

2406

2408

2412

2418 2443 2454

2459

2469

2485

2498 2503

2508

2514 2542 2546

2562 2564

2569

2577

2581 2588

2590

2596 2604

2609 2614

2615

2626 2628

2632

2638

2643

2645

2672

2680 2682

2689 2695

2698

2721 5572

5575

5576

5578 5579 5580 5583

5584

5852 5853

5855

5856 5862

5864

5866

5868

5869 5895

5896

5899

5904

5911 5912

5915 5919

5922 5924

5938

5939

5948

5951

5954

5956 5961 5963

5966 5969

5971 5972

5980

5983 5988

5991

5994 5995

5996

5997

6001

6004

6007

6008 6013

6014 6016 6019 6020

6021

6023

6024

6027

6029 6030

6031 6033

6034

6035

6036

6037

6043

6045

6071

6087

6100 6102

6111

6113 6132

6133

6142 6146

6147

6148

6149

6150

6151

6152 6153 6154

6155 6156 6157 6159

6161

6162 6163

6165

6166

6168 6184

6214

6216

6217 6218

6219

6220

6222 6223

6224 7072

7077 7090 7091

7094

7100

7101

7102

7104

7110

7112

7114

7120

7128

7136

7137

7141

7143 7161

7173

7174 7175

7176

7178

7179

7185

7186

7188 7189 7191

7192 7193 7194

7197

7199

7203 7206

7207

7209 7222 7224 7230

7234 7237

7239

7241

7244

7379

7380

7382 7385

7402

7435

7438

7443

7451

7456

7466

7469

7473

7474 7476 7477

7478

7480

7483

7484

7485

7496

7497

7500 7501 7502 7503 7505 7514

7516

7523 7524

7526

7535

7561

7562 7563

7566

7567 7570

7630

7634

7635

7636 7639 7641 7642 7643 7644 7647 7648

7652

7659

7666

7667

7668

6000b

6160b

6161b 7101b

7477b

7480b 7641b

Eigenvalues

What can we say about the genetic diversity among these genotypes as inferred by PCA?The function loadingplot allows to visualize the contribution of each allele, expressed assquared loadings, for a given principal component. Using this function, reproduce this figure:

loadingplot(rupica.pca1$c1^2)

10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Loading plot

Variables

Load

ings

Bm203.221

Eth225.136Eth225.142

Eth225.146

Eth225.148

Eth225.151Hel1.120Bm4505.260

Bm4505.266Bmc1009.230

Ilst030.161

Maf70.134Maf70.139 Sr.231

What do we observe? We can get back to the genotypes for the concerned markers (e.g.,Bm203) to check whether the highlighted genotypes are uncommon. tab extracts the tableof allele frequencies from a genind object:

X <- tab(rupica)

class(X)

## [1] "matrix"

dim(X)

## [1] 335 55

bm203.221 <- X[,"Bm203.221"]

table(bm203.221)

## bm203.221

## 0 1

## 331 4

11

Only 4 genotypes possess one copy of the allele 221 of marker bm203 (the second resultcorresponds to a replaced missing data). Which individuals are they?

rownames(X)[bm203.221>0]

## [1] "8" "86" "600" "7385"

Conclusion?

3 Mapping and testing PCA results

A frequent practice in spatial genetics is mapping the first principal components (PCs) ontothe geographic space. The function s.value is well-suited to do so, using black and whitesquares of variable size for positive and negative values. To give a legend for this type ofrepresentation:

s.value(cbind(1:11,rep(1,11)), -5:5, cleg=0)

text(1:11,rep(1,11), -5:5, col="red",cex=1.5)

d = 2

−5 −4 −3 −2 −1 0 1 2 3 4 5

12

Apply this graphical representation to the first two PCs of the PCA:

showBauges <- other(rupica)$showBauges

showBauges()

s.value(other(rupica)$xy, rupica.pca1$li[,1], add.p=TRUE, cleg=0.5)

title("PCA - first PC",col.main="yellow" ,line=-2, cex.main=2)

500

500

500

500

1000

1000

100

0

1000

150

0

150

0

1500

1500

150

0

1500

1500

2000

200

0

5 km N −22.5 −17.5 −12.5 −7.5 −2.5 2.5

PCA − first PC

showBauges()

s.value(other(rupica)$xy, rupica.pca1$li[,2], add.p=TRUE, csize=0.7)

title("PCA - second PC",col.main="yellow" ,line=-2, cex.main=2)

13

500

500

500

500

1000

1000

100

0

1000

150

0

150

0

1500

1500

150

0

1500

1500

2000

200

0

5 km N −5 −3 −1 1 3 5

PCA − second PC

What can we say about spatial genetic structure as inferred by PCA? This visual assessmentcan be complemented by testing the spatial autocorrelation in the first PCs of PCA. Thiscan be achieved using Moran’s I test. Use the function moran.mc in the package spdep toperform these tests. You will need first to define the spatial connectivity between the sampledindividuals. For these data, spatial connectivity is best defined as the overlap betweenhome ranges of individuals. Home ranges will be modelled as disks with a radius of 1150m.Use chooseCN to create a connection network based on distance range (“neighbourhoodby distance”). What threshold distance do you choose for individuals to be considered asneighbours?

rupica.graph <- chooseCN(other(rupica)$xy,type=5,d1=0,d2=2300, plot=FALSE,

res="listw")

The connection network should ressemble this:

rupica.graph

## Characteristics of weights list object:

## Neighbour list object:

14

## Number of regions: 335

## Number of nonzero links: 18018

## Percentage nonzero weights: 16.05525

## Average number of links: 53.78507

##

## Weights style: W

## Weights constants summary:

## n nn S0 S1 S2

## W 335 112225 335 15.04311 1352.07

plot(rupica.graph, other(rupica)$xy)

title("rupica.graph")

●

●

●●

●●

●

●●

●●

● ●●●●●

●●●

●

●

●●

●

●

●

●●

●●

●●●●

●●

●●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●●

● ●

●

●●●●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●● ●●●

●

●●

●

●●

●

●

●●●●

●●

●●● ●

●●●

●●●

●

●●

●

●

●

●●●

●

●

●●

●

●●● ●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●●

●

●

●

● ●●●

●

●●

●●

●

●

●

●

●

●

●●● ●

●

●

●

●

●●

●

●

●●●

●●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

rupica.graph

Perform Moran’s test for the first two PCs, and plot the results. The first test should besignificant:

pc1.mctest <- moran.mc(rupica.pca1$li[,1], rupica.graph, 999)

plot(pc1.mctest)

15

−0.02 0.00 0.02 0.04

010

2030

4050

60

Density plot of permutation outcomes

Monte−Carlo simulation of Moran's Irupica.pca1$li[, 1]

Den

sity

Compare this result to the mapping of the first PC of PCA. What is wrong? When a testgives unexpected results, it is worth looking into the data in more details. Moran’s plot(moran.plot) plots the tested variable against its lagged vector. Use it on the first PC:

moran.plot(rupica.pca1$li[,1], rupica.graph)

16

●

●

●

●

●●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●●●

●●

●●

●

●

●

●

●●●● ●●

●●●●

●●●

●● ●●

●

●●●●●●

●●●●●

●●

● ●●

●

●●

●

●●●●

●

●

●●●● ●●● ●●●●●● ●●● ●●●● ●●●

●●●●

●●●

●●●●●●●

● ●

●

●●

●

●●●●

●●

●

●

●● ●●

●

●

●●●● ●●

●

●●

●●●

●●

●

●

●

●

●

●

●

●

●● ●●●●●●●●●

●

●●

●

●

●●●

●●●

●●

●

●●

●

●

●●

●

●

●●

●

●●●●●●●●●●●●●●

● ●●

● ●● ●

●●

●●●●●

●●

●

●

●●

●●●●

●

● ●●●●●●

●●

●●

●

●●●●●●●

●

●

●●●

●

●● ●●●●

●

●●

●

●●

●●●●●

●●●●●● ●● ●

●

●●●

●

●●

●●●●● ●●

●●●●

●●●●

●●

●

●●

●

●●

●●●●

●

●●●●●●

●

●

−20 −15 −10 −5 0

−2.

0−

1.5

−1.

0−

0.5

0.0

rupica.pca1$li[, 1]

spat

ially

lagg

ed r

upic

a.pc

a1$l

i[, 1

]1

2

3

4

56

7

9

1011

12151617

19

20

21

22

23

24

25

29

43

273

274275

276

Actual positive autocorrelation corresponds to a positive correlation between a variable andits lag vector. Is it the case here? How can we explain that Moran’s test was significant?

Repeat these analyses for the second PC. What are your conclusions?

pc2.mctest <- moran.mc(rupica.pca1$li[,2], rupica.graph, 999)

pc2.mctest

##

## Monte-Carlo simulation of Moran's I

##

## data: rupica.pca1$li[, 2]

## weights: rupica.graph

## number of simulations + 1: 1000

##

## statistic = 0.0048162, observed rank = 797, p-value = 0.203

## alternative hypothesis: greater

plot(pc2.mctest)

17

−0.02 0.00 0.02 0.04

010

2030

40

Density plot of permutation outcomes

Monte−Carlo simulation of Moran's Irupica.pca1$li[, 2]

Den

sity

4 Multivariate tests of spatial structure

So far, we have only tested the existence of spatial structures in the first two principalcomponents of a PCA of the data. Therefore, these tests only describe one fragment of thedata, and do not encompass the whole diversity in the data. As a complement, we can useMantel test (mantel.randtest) to test spatial structures in the whole data, by assessing thecorrelation between genetic distances and geographic distances. Pairwise Euclidean distancesare computed using dist. Perform Mantel test, using the scaled genetic data you used beforein PCA, and the geographic coordinates.

mtest <- mantel.randtest(dist(rupica.X), dist(other(rupica)$xy))

plot(mtest, nclass=30)

18

Histogram of sim

sim

Fre

quen

cy

−0.05 0.00 0.05

020

4060

8010

0

What is your conclusion? Shall we be looking for spatial structures? If so, how can weexplain that PCA did not reveal them?

5 spatial Principal Component Analysis

The spatial Principal Component Analysis (sPCA, function spca [3]) has been especiallydeveloped to investigate hidden or non-obvious spatial genetic patterns. Like Moran’s I test,sPCA first requires the spatial proximities between genotypes to be modeled. You will reusethe connection network defined previously using chooseCN, and pass it as the ’cn’ argumentof the function spca.

Read the documentation of spca, and apply the function to the dataset rupica. Thefunction will display a barplot of eigenvalues:

rupica.spca1 <- spca(rupica, cn=rupica.graph,scannf=FALSE, nfposi=2,nfnega=0)

barplot(rupica.spca1$eig, col=rep(c("red","grey"), c(2,1000)) )

19

−0.

005

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

This figure illustrates the fundamental difference between PCA and sPCA. Like dudi.pca,spca displays a barplot of eigenvalues, but unlike in PCA, eigenvalues of sPCA can alsobe negative. This is because the criterion optimized by the analysis can have positive andnegative values, corresponding respectively to positive and negative autocorrelation. Positivespatial autocorrelation correspond to greater genetic similarity between geographically closerindividuals. Conversely, negative spatial autocorrelation corresponds to greater dissimilaritybetween neighbours. The spatial autocorrelation of a variable is measured by Moran’s I, andinterpreted as follows:

• I0 = −1/(n− 1): no spatial autocorrelation (x is randomly distributed across space)

• I > I0: positive spatial autocorrelation

• I < I0: negative spatial autocorrelation

Principal components of PCA ensure that (φ referring to one PC) var(φ) is maximum.By contrast, sPCA provides PC which decompose the quantity var(φ)I(φ). In other words,PCA focuses on variability only, while sPCA is a compromise between variability (var(φ))and spatial structure (I(φ)).

20

In this case, only the principal components associated with the two first positiveeigenvalues (in red) shall be retained. The printing of spca objects is more explicit thandudi objects, but named with the same conventions:

rupica.spca1

## ########################################

## # spatial Principal Component Analysis #

## ########################################

## class: spca

## $call: spca(obj = rupica, cn = rupica.graph, scannf = FALSE, nfposi = 2,

## nfnega = 0)

##

## $nfposi: 2 axis-components saved

## $nfnega: 0 axis-components saved

## Positive eigenvalues: 0.03001 0.01445 0.00924 0.006851 0.004485 ...

## Negative eigenvalues: -0.008643 -0.006407 -0.004488 -0.003977 -0.003248 ...

##

## vector length mode content

## 1 $eig 51 numeric eigenvalues

##

## data.frame nrow ncol

## 1 $c1 55 2

## 2 $li 335 2

## 3 $ls 335 2

## 4 $as 2 2

## content

## 1 principal axes: scaled vectors of alleles loadings

## 2 principal components: coordinates of entities ('scores')

## 3 lag vector of principal components

## 4 pca axes onto spca axes

##

## $xy: matrix of spatial coordinates

## $lw: a list of spatial weights (class 'listw')

##

## other elements: NULL

Unlike usual multivariate analyses, eigenvalues of sPCA are composite: they measure boththe genetic diversity (variance) and the spatial structure (spatial autocorrelation measuredby Moran’s I). This decomposition can also be used to choose which principal componentto interprete. The function screeplot allows to display this information graphically:

screeplot(rupica.spca1)

21

0.0 0.2 0.4 0.6 0.8 1.0

Variance

Spa

tial a

utoc

orre

latio

n (I

)

λ1

λ2λ3λ4λ5λ6λ7λ8λ9λ10λ11λ12λ13λ14λ15λ16λ17λ18λ19λ20λ21λ22λ23λ24λ25λ26λ27λ28λ29λ30λ31λ32λ33λ34 λ35λ36λ37λ38λ39λ40 λ41λ42 λ43λ44λ45 λ46λ47 λ48λ49λ50 λ51

−0.4

00

0.3

0.7

1

Spatial and variance components of the eigenvalues

While λ1 indicates with no doubt a structure, the second eigenvalue, λ2 is less clearly distinctfrom the successive values. Thus, we shall keep in mind this uncertainty when interpretingthe second principal component of the analysis.

Try visualising the sPCA results as you did before with the PCA results. To clarify thepossible spatial patterns, you can map the lagged PC ($ls) instead of the PC ($li), which area ’denoisified’ version of the PCs.

First, map the first principal component of sPCA. How would you interprete this result?How does it compare to the first PC of PCA? What inferrence can we make about the waythe landscape influences gene flow in this population of Chamois?

Do the same with the second PC of sPCA. Some field observations suggest that thispattern is not artefactual. How would you interprete this second structure?

To finish, you can try representing both structures at the same time using the color codingintroduced by [1] (?colorplot). The final figure should ressemble this (although colors maychange from one computer to another):

22

showBauges()

colorplot(other(rupica)$xy, rupica.spca1$ls, axes=1:2, transp=TRUE, add=TRUE,

cex=2)

title("sPCA - colorplot of PC 1 and 2\n(lagged scores)", col.main="yellow",

line=-2, cex=2)

500

500

500

500

1000

1000

100

0

1000

150

0

150

0

1500

1500

150

0

1500

1500

2000

200

0

5 km N

sPCA − colorplot of PC 1 and 2(lagged scores)

23

6 To go further

More spatial genetics methods and are presented in the basics tutorial as well as in thetutorial dedicated to sPCA, which you can access from the adegenet website:http://adegenet.r-forge.r-project.org/

or by typing:

adegenetTutorial("basics")

adegenetTutorial("spca")

24

http://adegenet.r-forge.r-project.org/

References

[1] L. L. Cavalli-Sforza, P. Menozzi, and A. Piazza. Demic expansions and human evolution.Science, 259:639–646, 1993.

[2] I. T. Joliffe. Principal Component Analysis. Springer Series in Statistics. Springer, secondedition, 2004.

[3] T. Jombart, S. Devillard, A.-B. Dufour, and D. Pontier. Revealing cryptic spatial patternsin genetic variability by a new multivariate method. Heredity, 101:92–103, 2008.

[4] K. Pearson. On lines and planes of closest fit to systems of points in space. PhilosophicalMagazine, 2:559–572, 1901.

25

Documents

Multivariate analysis of genetic data: investigating spatial ...adegenet.r-forge.r-project.org/files/Glasgow2015/...Multivariate analysis of genetic data: investigating spatial structures