Upload
cuthbert-richard
View
222
Download
3
Tags:
Embed Size (px)
Citation preview
Population Stratification
Qunyuan ZhangDivision of Statistical Genomics
GEMS Course M21-621 Computational Statistical Genetics
Mar. 24, 2011
https://dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx
1
What is Population Stratification (PS) ?
In narrow sense PS is the presence of a
systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry or origins, especially in the context of genetic association studies. Population stratification is also referred to as population structure.
In broad sense PS can be regarded as the
presence of a difference in relatedness between individuals in a population, due to different subpopulations, family/pedigree structure and/or cryptic relation.
2
False Positives (inflation)
Association could be due to the underlying structure of the population, even there is no disease-locus association.
PS & False Positives
3
An Example of PS-caused False Positive
Sub-population 1case control total risk
A 72 8 80 9/1a 18 2 20 9/1total 90 10 100 9/1Sub-population 2
case control total riskA 3 27 30 1/9a 7 63 70 1/9
10 90 100 1/9Mixed population
case control total riskA 75 35 110 2.14a 25 65 90 0.38
100 100 200 1.00
• No disease-locus association.
• Risk difference between sub-populations.
• Allele Frequency difference between sub-populations.
• False disease-locus association in mixed population. (any allele with higher frequency in higher-risk sub-population seems to be risk allele)
4
Mantel-Haenszel Test for Stratification
Adjusted RR
Standard error
Chi-square test
An Example
(1)
(2)
(3)
5
Linear Model
Marker data
Population structure variableGenetic background variableMembership variableSubgroup/sub-population variableAncestry/admixture proportion variable
Usually Q is unknown, needs to be estimated
6
-0.28 -0.95 0.11-0.75 0.29 0.59-0.60 0.08 -0.80
Estimating Q by Eigen-analysis
References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT)
X = U S VT
Q1 Q2 Q3Eigenvector of COV(X)
T
idv1 idv2 idv3snp1 0 2 1snp2 1 2 2snp3 0 0 1snp4 0 1 0snp5 2 0 0
-0.55 0.33 0.34-0.78 -0.10 -0.27-0.16 0.04 -0.71-0.20 0.14 0.52-0.15 -0.93 0.20
3.81 0.00 0.000.00 2.05 0.000.00 0.00 1.13
singular values
eigenvaluesS2
14.51 0.00 0.00
0.00 4.21 0.00
0.00 0.00 1.28
Or SAS Proc PRINCOM; R svd() and eigen() 7
Eigen-analysis of HapMap Populations
Q1
Q2
8
Estimating Q by MLE(for admixed population)
G: Observed genotypes of admixed [and parental populations]Q: Allelic frequencies in parental populationsP : Individual membership to be estimated
Goal: obtain P that maximizes Pr(G|P,Q)
1. Assign prior values for Q (randomly or estimated from parental population genotype data) & P (randomly)
2. Compute P(i) by solving
3. Compute Q(i) by solving
4. Iterate Steps 1 and 2 until convergence.
Tang et al. Genetic Epidemiology, 2005(28): 289–301
0)(
),|(
P
PQG
0)(
),|(
Q
PQG
9
Observed G : genotypes of admixed [and parental populations]
Unknown Z : admixed individuals’ membership from ancestral populations
Problem: How to estimate Z ?
Bayesian and Markov Chain Monte Carlo (MCMC) methods1. Assume ancestral population number K (see next slide)
2. Define prior distribution Pr(Z) under K3. Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z) Pr(∙ G|
Z)
4. Average over large number of MCMC samples to obtain estimate of Z
Falush et al. Genetics, 2003(164):1567–1587 Software : STRUCTURE
Estimating Q by MCMC(for admixed population)
10
Infer Population Number (K)
11
Linear Model (an example including m Q-variables)
eQbQbQbbxay mm ...2211
eQbbxaym
iii
1
SAS Proc REG, Proc GENMOD; R lm(), glm()
Generalized, can fit binary/categorical y 12
Unified Mixed Model(more general)
SNP(s)
Inferred population membership
ID matrixCovariate(s)
V = Z G Z ' + R
Modeling the resemblance among individuals
13
Multi-Variate Normal Distribution (MVN) & Likelihood of Mixed Model
Based on MVN, the likelihood of trait (y) in a matrix form is:
no. of individuals (in a pedigree) nn variance-
covariance matrix
phenotype vector
mean phenotype
vector
V = Z G Z ' + R
IV ea222
Kinship (IBD) matrix (nn )
14
Kinship
Inbreeding CoefficientThe inbreeding coefficient of an individual is the probability that the pair of alleles carried by the gametes that produced it are Identical By Descent (IBD).
Identical By Descent (IBD)Two alleles come from the same ancestry.
Kinship/Coancestry
The inbreeding coefficient of an individual is equal to the coancestry between its parents. For example if parents X and Y have a child Z, theninbreeding coefficient of Z = coancestry between X and Y
Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al. (need pedigree and/or marker data)
15
Kinship Matrix (expected probability of allele sharing among
relatives)
16
Resources for Mixed Model with Kinship Matrix
Software Kinship Mixed Model Data
SAS Proc INBREED Proc MIXED Quantitative traitPedigree data
SAS Proc INBREED Proc GLIMMIX Quantitative/qualitative trait, Pedigree data
R : kinship makekinship() lmekin() Quantitative traitPedigree data
R: emma emma.kinship() emma.REML.t() Quantitative traitUsing maker data to calculate kinship
EMMAX emmax-kin emmax
17
Diagnosis of Inflation of False Positives
• Inflation: more false positives than expected under the null
• In GWAS, usually due to PS
• Can be caused by inappropriate statistical methods even with no PS
• May (not necessarily) indicate PS
18
Theoretical Basis of Diagnosis Uniform distribution [0,1] of p-values under the null
Histogram
-log10(p)Q-Q plot
inflationno inflation
19
Inflation Rate (IR)
For Binary Trait
For Continuous Trait
Amin , Duijn, Aulchenko, 2007
Devlin et al. 2004
20
Genomic Control (by IR)
For Binary Trait
For Continuous Trait
22iiY 22 )( ii tY
Or based on p-value 2)1,1(
2 dfpi i
Y
21
22 ~
ˆ~
dfi
i
YY
)~
(Pr~ 221 idfi Yobp
21
Practice• Download and unzip the data from dsgweb.wustl.edu/qunyuan/data/ popstra2011hw.zip• Ignore pedigree.csv, test each SNP in snp.csv for association (with trait in
trait.csv);• Investigate p-values to see if there is any inflation;• Try to explain why;• List some possible methods to reduce or control the inflation;• Choose one method, apply it to the data;• Does it work? • Try to explain why. • Clearly document each step of you analysis.
The is no standard answer, feel free to try anything you like !
Report back to [email protected] and [email protected] in one week. Thanks !
22