23
Confounding from Confounding from Cryptic Cryptic Relatedness in Relatedness in Association Association Studies Studies Benjamin F. Voight Benjamin F. Voight (work jointly with JK (work jointly with JK Pritchard) Pritchard)

Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Embed Size (px)

Citation preview

Page 1: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Confounding from Confounding from Cryptic Cryptic

Relatedness in Relatedness in Association StudiesAssociation Studies

Benjamin F. VoightBenjamin F. Voight

(work jointly with JK Pritchard)(work jointly with JK Pritchard)

Page 2: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Case/control association tests are becoming increasingly Case/control association tests are becoming increasingly popular to identify genes contributing to human disease.popular to identify genes contributing to human disease.

These tests can be susceptible to false positives if the These tests can be susceptible to false positives if the underlying statistical assumptions are violated, i.e. underlying statistical assumptions are violated, i.e. independence among all sampled alleles used in the test independence among all sampled alleles used in the test for association.for association.

It is well appreciated that population structure results in It is well appreciated that population structure results in false positives (Knowler false positives (Knowler et al.et al., 1988; Lander and Schork, , 1988; Lander and Schork, 1994). 1994).

Methods exist which correct for this effect (Devlin and Methods exist which correct for this effect (Devlin and Roeder, 1999; Pritchard and Rosenberg, 1999; Pritchard Roeder, 1999; Pritchard and Rosenberg, 1999; Pritchard et al.et al. 2000). 2000).

ImportanceImportance

Case/control association tests are becoming increasingly Case/control association tests are becoming increasingly popular to identify genes contributing to human disease.popular to identify genes contributing to human disease.

These tests can be susceptible to false positives if the These tests can be susceptible to false positives if the underlying statistical assumptions are violated, i.e. underlying statistical assumptions are violated, i.e. independence among all sampled alleles used in the test independence among all sampled alleles used in the test for association.for association.

It is well appreciated that population structure results in It is well appreciated that population structure results in false positives (Knowler false positives (Knowler et al.et al., 1988; Lander and Schork, , 1988; Lander and Schork, 1994). 1994).

Methods exist which correct for this effect (Devlin and Methods exist which correct for this effect (Devlin and Roeder, 1999; Pritchard and Rosenberg, 1999; Pritchard Roeder, 1999; Pritchard and Rosenberg, 1999; Pritchard et al.et al. 2000). 2000).

Page 3: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Cases are not independent draws from the population allele frequencies.

Problem: the relatedness iscryptic, so the investigatordoes not know about therelationships in advance.

Your (favorite) Population

Obtain a sample ofaffected cases from thepopulation.

Obtain a sample ofaffected cases from thepopulation.

Cases are not independent draws from the population allele frequencies.

Problem: the relatedness iscryptic, so the investigatordoes not know about therelationships in advance.

Page 4: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Devlin and Roeder (1999) have argued that if one is Devlin and Roeder (1999) have argued that if one is doing a genetic association study, then surely one doing a genetic association study, then surely one must believe that the trait of interest has a genetic must believe that the trait of interest has a genetic basis that is at least (partially) shared among affected basis that is at least (partially) shared among affected individuals.individuals. Given that cases share a set of risk factors by descent, then Given that cases share a set of risk factors by descent, then

presumably they are more related to one another than to presumably they are more related to one another than to random controls.random controls.

These authors presented numerical examples which These authors presented numerical examples which suggested that this effect may be an important factor, suggested that this effect may be an important factor, in practice.in practice.

However, these examples were artificially constructed, However, these examples were artificially constructed, and not modeled on any population-based process.and not modeled on any population-based process.

Few empirical data to suggest if cryptic relatedness Few empirical data to suggest if cryptic relatedness negatively impacts association studies. In a founder negatively impacts association studies. In a founder population, non-independence resulting from population, non-independence resulting from relatedness does matter. (Newman relatedness does matter. (Newman et al.et al., 2001). , 2001).

Devlin and Roeder (1999) have argued that if one is Devlin and Roeder (1999) have argued that if one is doing a genetic association study, then surely one doing a genetic association study, then surely one must believe that the trait of interest has a genetic must believe that the trait of interest has a genetic basis that is at least (partially) shared among affected basis that is at least (partially) shared among affected individuals.individuals. Given that cases share a set of risk factors by descent, then Given that cases share a set of risk factors by descent, then

presumably they are more related to one another than to presumably they are more related to one another than to random controls.random controls.

These authors presented numerical examples which These authors presented numerical examples which suggested that this effect may be an important factor, suggested that this effect may be an important factor, in practice.in practice.

However, these examples were artificially constructed, However, these examples were artificially constructed, and not modeled on any population-based process.and not modeled on any population-based process.

Few empirical data to suggest if cryptic relatedness Few empirical data to suggest if cryptic relatedness negatively impacts association studies. In a founder negatively impacts association studies. In a founder population, non-independence resulting from population, non-independence resulting from relatedness does matter. (Newman relatedness does matter. (Newman et al.et al., 2001). , 2001).

ImportanceImportance

Page 5: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Determine whether, or when, Determine whether, or when, cryptic relatedness is likely to cryptic relatedness is likely to be a problem for general be a problem for general applications.applications.

Develop a formal model for Develop a formal model for cryptic relatedness in a cryptic relatedness in a population genetics population genetics framework.framework.

In a founder population, In a founder population, estimate the inflation factor estimate the inflation factor due to (cryptic) relatedness, due to (cryptic) relatedness, and compare to analytical and compare to analytical results.results.

Avoid staring at “x” in front of Avoid staring at “x” in front of a chalkboard.a chalkboard.

GoalsGoals Determine whether, or when, Determine whether, or when,

cryptic relatedness is likely to cryptic relatedness is likely to be a problem for general be a problem for general applications.applications.

Develop a formal model for Develop a formal model for cryptic relatedness in a cryptic relatedness in a population genetics population genetics framework.framework.

In a founder population, In a founder population, estimate the inflation factor estimate the inflation factor due to (cryptic) relatedness, due to (cryptic) relatedness, and compare to analytical and compare to analytical results.results.

Avoid staring at “x” in front of Avoid staring at “x” in front of a chalkboard.a chalkboard.

Page 6: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

~

mm affected individuals and affected individuals and m m random controls, random controls, sampled in the current generation. sampled in the current generation.

Pairs of chromosomes coalesce in a previous Pairs of chromosomes coalesce in a previous generation t = 1, 2, … generation t = 1, 2, … t t with the usual with the usual probabilities. probabilities.

All samples are typed at a single bi-allelic locus, All samples are typed at a single bi-allelic locus, unlinked to disease, with alleles B and b, at unlinked to disease, with alleles B and b, at frequencies frequencies pp and (1- and (1-pp) in the population.) in the population.

Modeling DefinitionsModeling Definitions

mm affected individuals and affected individuals and m m random controls, random controls, sampled in the current generation. sampled in the current generation.

Pairs of chromosomes coalesce in a previous Pairs of chromosomes coalesce in a previous generation t = 1, 2, … generation t = 1, 2, … t t with the usual with the usual probabilities. probabilities.

All samples are typed at a single bi-allelic locus, All samples are typed at a single bi-allelic locus, unlinked to disease, with alleles B and b, at unlinked to disease, with alleles B and b, at frequencies frequencies pp and (1- and (1-pp) in the population.) in the population.

tt

t

tNN ~

1~

12

1

2

11

~

Page 7: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Define:Define: KKpp – population prevalence of disease. – population prevalence of disease.

KKtt – probability that an relative of type – probability that an relative of type tt (or t ) of an (or t ) of an affected proband is also affected.affected proband is also affected.

tt – recurrence risk ratio, K – recurrence risk ratio, Ktt/K/Kpp (Risch, 1990). (Risch, 1990).

GGii(a)(a) – indicator (0 or 1) for the B allele on homologous – indicator (0 or 1) for the B allele on homologous

chromosome chromosome aa for the for the i-i-th case. (with a th case. (with a for for diploid individuals)diploid individuals)

HHjj(a)(a) – as above, but for a – as above, but for a jj-th random control. -th random control.

Define:Define: Kp – population prevalence of disease.

Kt – probability that an relative of type t (or t ) of an affected proband is also affected.

t – recurrence risk ratio, Kt/Kp (Risch, 1990).

Gi(a) – indicator (0 or 1) for the B allele on homologous

chromosome a for the i-th case. (with a for diploid individuals)

Hj(a) – as above, but for a j-th random control.

DefinitionsDefinitions

~

Page 8: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Define a test statistic which measure the difference in allele counts between Define a test statistic which measure the difference in allele counts between cases and controls (slightly modified from Devlin and Roeder, 1999):cases and controls (slightly modified from Devlin and Roeder, 1999):

Under the null hypothesis of no association between the Under the null hypothesis of no association between the marker and phenotype, an allele has a genotype marker and phenotype, an allele has a genotype BB with with probability probability pp, , independently for all allelesindependently for all alleles in the sample. If so,

If cryptic relatedness exists in the sample, then the variance of the test If cryptic relatedness exists in the sample, then the variance of the test – call this Var– call this Var**[[T T ] – may exceed the variance under the null. We ] – may exceed the variance under the null. We measure the deviation from the null variance using the “inflation factor” measure the deviation from the null variance using the “inflation factor” ::

Define a test statistic which measure the difference in allele counts between Define a test statistic which measure the difference in allele counts between cases and controls (slightly modified from Devlin and Roeder, 1999):cases and controls (slightly modified from Devlin and Roeder, 1999):

.1

)2(

1

)1(

1

)2(

1

)1(

m

j

j

m

j

j

m

i

i

m

i

i HHGGT

Under the null hypothesis of no association between the Under the null hypothesis of no association between the marker and phenotype, an allele has a genotype marker and phenotype, an allele has a genotype BB with with probability probability pp, , independently for all alleles in the sampleindependently for all alleles in the sample . If . If so, so,

).1(4][ pmpTVar

If cryptic relatedness exists in the sample, then the variance of the test If cryptic relatedness exists in the sample, then the variance of the test – call this Var– call this Var**[[T T ] – may exceed the variance under the null. We ] – may exceed the variance under the null. We measure the deviation from the null variance using the “inflation factor” measure the deviation from the null variance using the “inflation factor” ::

)1(4

][*

pmp

TVar srelatednes

Page 9: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

0

0.05

0.1

0.15

0.2

0.25

0 2 4 6 8 10 12 14 16 18 20

Chi Squared Value

Pro

bab

ility

Den

sity

= 1.0 (No Inflation Factor)

1% error rate

0

0.05

0.1

0.15

0.2

0.25

0 2 4 6 8 10 12 14 16 18 20

Chi Squared Value

Pro

bab

ility

Den

sity

= 1.0 (No Inflation Factor)

= 1.5

1% error rate

0

0.05

0.1

0.15

0.2

0.25

0 2 4 6 8 10 12 14 16 18 20

Chi Squared Value

Pro

bab

ility

Den

sity

= 1.0 (No Inflation Factor)

= 1.5

= 2.0

1% error rate

Type-I nominal ()

Fold-Error Rate

1.0.05 1.00

.01 1.00

Type-I nominal ()

Fold-Error Rate

1.5.05 ~2.19

.01 ~3.55

Type-I nominal ()

Fold-Error Rate

2.0.05 ~3.32

.01 ~6.88

Page 10: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Recall that we want the variance to our test, T, under a model of cryptic relatedness:

Use the following non-dodgy assumptions:1. Draws of alleles from the population are simple Bernoulli trials.

(Variance terms)2. Controls are a random sample from the population. (Covariance

terms with Hj’s are 0)

3. Allow the possibility that cases and controls depart from Hardy-Weinberg proportions by some factor, call this F. (Covariance terms for alleles in the same individual)

4. For the mutational model,a. Suppose the mutation process is the same for cases and random controls.b. Conditional on a case and random chromosome having a very recent coalescent time (on the order of 1-10 generations), assume that the chance that the alleles are in different states is 0.

Recall that we want the variance to our test, T, under a model of cryptic relatedness:

.][1

)2(

1

)1(

1

)2(

1

)1(*

m

j

j

m

j

j

m

i

i

m

i

i HHGGVarTVar

Use the following non-dodgy assumptions:1. Draws of alleles from the population are simple Bernoulli trials.

(Variance terms)2. Controls are a random sample from the population. (Covariance

terms with Hj’s are 0)

3. Allow the possibility that cases and controls depart from Hardy-Weinberg proportions by some factor, call this F. (Covariance terms for alleles in the same individual)

4. For the mutational model,a. Suppose the mutation process is the same for cases and random controls.b. Conditional on a case and random chromosome having a very recent coalescent time (on the order of 1-10 generations), assume that the chance that the alleles are in different states is 0.

Page 11: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Then after …Then after …

JKP attempts desperately

to keep me honest.

Me, after many hoursof intensive thought

processing

Smoke from my brain

Page 12: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

VarVar**[[T T ] can be simplified to:] can be simplified to:

],[)1(4)1)(1(4][ )()(* ai

ai GGCovmmFpmpTVar

where i≠i´.

And now, we evaluate the covariance term under a model And now, we evaluate the covariance term under a model of cryptic relatedness. This covariance term is fairly of cryptic relatedness. This covariance term is fairly complicated, but it is related to the following probability:complicated, but it is related to the following probability:

]|~[ ),(),,( affiittP aiai

which denotes the probability that allele copy a and a´ from individuals i and i´ coalesce in time , conditional on the proposition that individuals i and i´ are both affected (with i≠i´). So what’s this probability?

t~

]|~[ affiittP ii

Page 13: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

]|~[ affiittP ii

][][

][]~,|[]~[

affiPaffiP

affiPttaffiaffiPttP ii

ii

2

~]~[

p

ptii

K

KKttP

tii ttP ~]~[

Depends on the population model

(not on phenotype)Depends on the genetic model

Apply some Bayesian Trickery:

][

]~|[]~[

affiiP

ttaffiiPttP iiii

… and after some plug and play we finally get:

1~~ )1(]~[)1(1

ttii ttPmF

Page 14: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)
Page 15: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Under an additive modelUnder an additive model

Handy relationship between any Handy relationship between any rr’s and the ’s and the siblingsibling recurrence risk ratio, a single parameter recurrence risk ratio, a single parameter under an additive model (Risch, 1990):under an additive model (Risch, 1990):

)1(4)1( srr

where where rr is the kinship coefficient for type-r relatives, is the kinship coefficient for type-r relatives, which is ¼ for r = 1, and decays by ½ for each which is ¼ for r = 1, and decays by ½ for each increment to r. Using this relationship we can simplifyincrement to r. Using this relationship we can simplify

1~

2

1

2

11

2

)1()1(1

1~

1~

t

t

t

s

NN

mF

Page 16: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

SimulationsSimulations Use Wright-Fisher forward simulation to assess analytical Use Wright-Fisher forward simulation to assess analytical

results:results:

Simulate 1,000 bi-allelic unlinked loci forward in time Simulate 1,000 bi-allelic unlinked loci forward in time 4N generations, with mutation parameter 4N generations, with mutation parameter = 4N = 4N = 1. = 1. (†)(†)

Choose a single locus with the desired disease allele Choose a single locus with the desired disease allele frequency, and assign phenotypes to all members of the frequency, and assign phenotypes to all members of the population under an additive genetic model.population under an additive genetic model.

Select Select m m cases and cases and m m random controls, use all non-random controls, use all non-disease loci to infer the inflation factor based on the disease loci to infer the inflation factor based on the mean of all tests.mean of all tests.

(†) (†) because WF simulations are notoriously slow to simulate, we use a speed-up by because WF simulations are notoriously slow to simulate, we use a speed-up by simulating a smaller population with a proportionally higher mutation rate, simulating a smaller population with a proportionally higher mutation rate, and then rescale the population size and mutation rate to the desired levels.and then rescale the population size and mutation rate to the desired levels.

Page 17: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Simulation ResultsSimulation Results

95% central interval about the mean was at least .001 in each case.

Page 18: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

““Tautological” Hutterite AnalysisTautological” Hutterite Analysis

Quick-note on the HutteritesQuick-note on the Hutterites 13,000 member pedigree where the 13,000 member pedigree where the

genealogy is known, with ~800 members genealogy is known, with ~800 members phenotyped/genotyped at many markers phenotyped/genotyped at many markers across the genome.across the genome.

Target (for each phenotype):Target (for each phenotype):a. Estimate coalescent probabilities for cases a. Estimate coalescent probabilities for cases

and random controls based on the genealogy and random controls based on the genealogy – “allele-walking” simulations– “allele-walking” simulations

b.b. Calculate the inflation factor (Calculate the inflation factor () for each ) for each phenotype, and compare to the analytic phenotype, and compare to the analytic prediction.prediction.

Page 19: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Note increased probabilities incases over random controlsfor recent coalescent times

Page 20: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Hutterite AnalysisHutterite Analysis Quick-note on the HutteritesQuick-note on the Hutterites

13,000 member pedigree where the 13,000 member pedigree where the genealogy is known, with ~800 members genealogy is known, with ~800 members phenotyped/genotyped at many markers phenotyped/genotyped at many markers across the genome.across the genome.

Target (for each phenotype):Target (for each phenotype):a. Estimate coalescent probabilities for cases a. Estimate coalescent probabilities for cases

and random controls based on the genealogy and random controls based on the genealogy – “allele-walking” simulations– “allele-walking” simulations

b. Calculate the inflation factor (b. Calculate the inflation factor () for each ) for each phenotype, and compare to the analytic phenotype, and compare to the analytic prediction.prediction.

Page 21: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

Empirical Empirical ’s in a Founder ’s in a Founder PopulationPopulation

The inbreeding coefficient (F) was estimated at .048 and was included in the calculation.

Page 22: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

We modeled cryptic relatedness using population-We modeled cryptic relatedness using population-based processes. Surprisingly, these expressions are based processes. Surprisingly, these expressions are functions of directly observable parameters functions of directly observable parameters (population size, sample size, and the genetic model (population size, sample size, and the genetic model parameterized by parameterized by rr). ).

Our analytical results indicate that increased false Our analytical results indicate that increased false positives due to cryptic relatedness will usually be positives due to cryptic relatedness will usually be negligible for outbred populations.negligible for outbred populations.

We applied out technique to a founder population as We applied out technique to a founder population as an example. For six different phenotypes we found an example. For six different phenotypes we found evidence for inflation, which matched analytic evidence for inflation, which matched analytic predictions.predictions.

SummarySummary

We modeled cryptic relatedness using population-based processes. Surprisingly, these expressions are functions of directly observable parameters (population size, sample size, and the genetic model parameterized by r).

Our analytical results indicate that increased false positives due to cryptic relatedness will usually be negligible for outbred populations.

We applied out technique to a founder population as an example. For six different phenotypes we found evidence for inflation, which matched analytic predictions.

Page 23: Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)

AcknowledgementsAcknowledgements

JK Pritchard and NJ JK Pritchard and NJ Cox (thesis advisors)Cox (thesis advisors)

Carole Ober (access Carole Ober (access to the empirical to the empirical data)data)

$/£ : $/£ : NIH, NIH/NIGMS NIH, NIH/NIGMS

Genetics Training Genetics Training GrantGrant

In the bar at the conference during the week

Fine, name that tune: from memory, recite

of the first 1677 words of Kingman’s 1982 paper and I’ll get the next round.