Can we infer historical population movements from principal component analysis of genetic data?

Can we infer historical population movements from principal component analysis of genetic data?Saharon Rosset,

Cover of Science, 1.9.1978Menozzi, P., Piazza A. & Cavalli-Sforza, L. Synthetic Maps of Human Gene Frequencies in Europeans. Science 201, 786792 (1978).

How was the map generated?Menozzi et al. collected information on genetic markers in 67 European populations Before DNA, markers were: blood types, HLA types, etc.Total: 38 markers (columns in the data matrix) They performed a principal components analysis (PCA), describing the directions of maximum variance in the data (i.e., what genetic patterns best separate populations) Plotted the projections of top PCs on a map and looked for patterns The top PC is on the cover, and shows a clear pattern radiating out of the middle east

What does the map tell us?Main conclusion of Menozzi et al.: the leading contribution to modern European gene pool is from a migration out of the Middle East Obvious candidate: the Neolithic (farming) expansion, circa 6000BC We have archaeological evidence of spread of farming, now also genetic evidence that people also moved and replaced ancestral European populations?

In 1994, the seminal book by Cavalli-Sforza et al.* performed similar analyses for all continents, and reached significant conclusions* For the rest of this talk, the 78 and 94 works combined will be referred to as Cavalli Sforza et al.

Nagging 30-year old questions

How can we differentiate statistical artifacts from real effects in such analyses? How can we more specifically pinpoint which historical movements are responsible for the effects we see?

From Novembre & Stephens (Nature Genetics, 5/2008)Here, we find that gradients and waves observed in Cavalli-Sforza et al.s maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events. Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40, 646649 (2008).

So, are the results real or mathematical artifacts? As we will see, Novembre and Stephens neglected to take some critical points into account and to re-analyze the original data

Our main goal: critically re-evaluate the results of the original paper and the claims of Novembre & Stephens

Other recent high profile work on PCA in geneticsSince that 1978 breakthrough, PCA has played a significant role in various areas of genetics. Examples of recent high impact work: Novembre J. et al. Genes mirror geography within Europe, Nature 456, 98-101 (6 November 2008)Similar goals as the 78 paper, except the data are BIGGER: ~340K SNPs on thousands of Europeans Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38, 904-909 (2006)This work introduced eigentstrat a central tool in search for functional associations in the genome, used to whiten the stratification signal

Reminder: PCA and eigen-analysisAssume we have an NK matrix X whose rows are the xis Z=XTX is a symmetric KK matrix, and its eigenvectors v1,v2,,vK K are an orthogonal basis of K such that:Zvk=kvk, where 12 K 0 are the eigenvalues PC1 (maximal spread) is in fact defined by the direction v1, and the portion of data spread it explains is 1/k kSimilarly, v2 defines PC2 and so on

In 78 paper, 1/k k>0.3 and 1/2 2 (this will be important later!)

Local migrations and PCA patternsAssume we have a collection of populations with geographic organization.For example, on a square or rectangular grid Each population has low migration rate to/from its immediate neighbors Novembre & Stephens main point: In this situation, if the grid is square or rectangular, the top PCs of genetic variation data will tend to be axis orientedAs in the Cavalli-Sforza et al. works

A schematic of the local migration modelPop1Pop2Pop7Pop3Pop6Pop5Pop11Pop10Pop9Pop8Pop4Pop12Pop15Pop14Pop13Pop16Populations reside on a regular gridIn every generation, 1- in the individuals in each population stay, while migrate to the neighbors, replaced by migrants from there

A schematic of the local migration modelPop1Pop2Pop6Pop3Pop5Pop4Pop9Pop8Pop7Populations reside on a regular gridIn every generation, 1- of the individuals in each population stay, while migrate to the neighbors, replaced by migrants from there

Resulting similarity (or covariance) between populationsNaturally close-by populations will be more genetically similar in this setup, with similarity decaying quickly (typically exponentially) with distanceGives rise to a population covariance matrix (E(Z)=E(XXT)) that has special structure and a wonderful name: E(Z) will be a Block Toeplitz with Toeplitz Block matrixPop1Pop2Pop3Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7 Pop8 Pop9 Pop4Pop5Pop6Pop7Pop8Pop9Row 1Row 2Row 3Row 1Row 2Row 3

Eigenvectors of Toeplitz matricesNovembre and Stephens discuss the known structure of the top eigenvectors of such matricesEigenvectors are given by the 2-dimensional discrete cosine transformThe top population PCs are sinusoidal patterns with geographical structureTypically axis oriented Figure 1 of their paper demonstrates that this theory fits well with the results in the Cavalli-Sforza et al. worksHence their conclusion

OK, so we are done, right? Not really!Because N&S neglect to discuss two critical, related aspects:Eigenvalues: what is the ratio between % of variance explained of top PC(s) and other PCs?Variance: do we really expect the top PCs in real data to conform exactly to theory?

Once these two aspects are brought into the equation, it is no longer true that the results of C-S et al. are explained by the Toeplitz phenomena!

Asymptotics of Toeplitz eigenvaluesBasic formula (Bttcher and Silbermann 99):

T is a block-Toeplitz matrix of rr blocks, each of size kkA is the symbol of the matrix T |||A(ei)||| is the operator normLots of technical conditions and limitations (e.g., expression on right has to be finite for the result to be meaningful)

In English: asymptotically, the top eigenvalues are close to each otherFor all results in Cavalli-Sforza et al. works, we have 1/22

Toeplitz eigenvalues (ctd.)

We can also make simple, non-asymptotic statements, such as: If the environment is a square and migration only local, then by symmetry the North-South and East-West eigenvalues should be similar

It seems we should have two regimes: For small data, variance could prevent us from recovering the population eigenvectorsFor large data, the asymptotics should prevent us from seeing extreme ratios for 1/2

Understanding which scenario we are in: simulationN&S simulate populations with local migration on a 1515 grid, with 500 genetic markers

Their Figure S1 shows a nice match of the data eigenvectors to the Toeplitz theory.

But is this representative of the Cavalli Sforza et al. data of Only 67 populations and 38 markers? Not really

What about the ratio 1/2 in their simulations?

Simulation results: the two scenariosThe big simulations are consistent with eigenvector directions but absolutely not with eigenvalue ratios eigenvalue asymptotics at work!The small simulations are barely consistent with eigenvalue ratios but not at all with eigenvector directions variance at workStop

Summary so far, next stepsThe effects in Cavalli-Sforza et al. are too big to be explained by local correlations as suggested by Novembre and Stephens On similar simulations with no signal the magnitude of 1/ 2 was smaller but comparable So are the geographic effects real and reproducible? Key next step: critical reanalysis of the original dataDoes correct analysis still produce the same results?Do modern approaches for inference on PCA (hypothesis testing, confidence intervals) confirm validity?

Data and analysis of 78 paperThe original paper used frequency of 21 Human Leukocyte Antigen (HLA) alleles in 67 West Eurasian populations, and 17 other markers in a different set of populationsRequired a complex interpolation scheme to get 21+17=38 variables on the same populationsMany assumptions of PCA are violated by interpolations Realizing the difficulties, the authors also performed analysis on the 6721 matrix of HLA dataResults were very similarSome basic assumptions of PCA still violated Our task: reconstruct this HLA matrix and critically re-analyze it

Chasing the HLA data from the original paperHLA data was taken from 1970s books and articlesProblem 1: locating themProblem 2: figuring out HLA nomenclature Finally managed to recover 5421 matrix (13 other populations found but inconsistent nomenclature) Statistical challenges:Different number of individuals from each population Rows do not have similar varianceDifferent prevalence for each HLA marker Columns do not have mean 0 and same varianceHLA markers have complex dependence between them Columns are not independent But, thankfully, this dependence is weak (weaker than multinomial)

Idealized statistical modelDenote the among-populations NN covariance matrix by Hpop Idealized PCA model: X = Hpop1/2 Y, whereXNK is observed dataYNK is random noise (say all entries are i.i.d Gaussian)In other words, the matrix XXT = Hpop1/2 YYT Hpop1/2 is a scaled noisy version of the matrix Hpop Characteristics of X in this model:i.i.d columns with mean 0Rows have mean 0If Hpop is regular (e.g., Toeplitz), then all entries are identically distributedUnless Hpop is trivial (e.g. identity), then rows are not independent

Matching data and modelEntry Xij in our data matrix has a Bin(ni,pij) distributionAssume Xij,Xik (=columns) are independent (though not exactly true)

We can think of the pijs as representing our idealized X matrixIndependent stochastic evolution of each marker based on (finite) population relationships Still have to worry about centralizing and standardizing them

In this situation, the binomial noise in the actual Xijs is a nuisance that does not follow our model

Matching data and model (ctd.)First step: variance stabilizing transformationFact: arcsin(Xij/ni) has variance approximately 1/4ni, Independently of pijSecond step: centralize columns

Result: approximately i.i.d columns, approximately mean 0 rows still have different variance

Interpretation: our standardized data has the form X = Hpop1/2 Y + Z, with Zij~N (0,1/2ni) independentGeometrically: perturbed versions of our desired data, but with small noise

Bad third steps: standardize columns or multiply each row i by 2ni to standardize varianceProblem: scaling our signal in the pijs

Comparison: first PC in 78 vs oursMain messages remain the same: Exceptionally high variance in first PC Clear correlation with North/South axis or distance from Middle EastConclusion: the main findings of Cavalli-Sforza et al. withstand a re-analysis. *dist from ME: distance from Middle East (Baghdad) In parentheses: bootstrap 95% confidence intervals*

Data

% var in PC1

Cor. dist from ME

Cor. N/S axis

Cor. E/W axis

Original

39% [32-44]

0.82 [0.66-0.9]

0.88 [0.75-0.94]

0.32

Steps 1-2

40% [33-45]

0.82 [0.67-0.9]

0.87 [0.74-0.93]

0.33

ConclusionsPCA was, is (and will be?) an important tool in genetics The works of Cavalli-Sforza et al. show geographic patterns of genetic variation that cannot be discarded as either artifacts or coincidence due to variance We have also reanalyzed parts of their data to verify that their PCA results hold up to renewed examinationBootstrap study also confirms validity of main findings (not shown here) Can we differentiate local migration that prefers North-South to East-West consistently from a real expansion?If yes, is this really the Neolithic expansion? Novembre & Stephens do point to an intriguing connection between local migration and global effects

N&S big simulationSmall simulationPC1 PC2 PC3

Documents

Can we infer historical population movements from principal component analysis of genetic data?