48
Statistical Issues in Genetic Association Studies Eleanor Feingold, Ph.D. University of Pittsburgh March, 2011

Statistical Issues in Genetic Association Studies

  • Upload
    nerice

  • View
    42

  • Download
    1

Embed Size (px)

DESCRIPTION

Statistical Issues in Genetic Association Studies. Eleanor Feingold, Ph.D. University of Pittsburgh March, 2011. Underlying Principle of Genetic Mapping - PowerPoint PPT Presentation

Citation preview

Page 1: Statistical Issues in Genetic Association Studies

Statistical Issues in Genetic Association Studies

Eleanor Feingold, Ph.D.

University of Pittsburgh

March, 2011

Page 2: Statistical Issues in Genetic Association Studies

Underlying Principle of Genetic Mapping

People who have similar traits (phenotypes) should have greater than expected sharing of genetic material near the genes that influence those traits.

Page 3: Statistical Issues in Genetic Association Studies

Basic study designs for gene mapping

families unrelatedindividuals

Page 4: Statistical Issues in Genetic Association Studies

Basic study designs for gene mapping

families unrelatedindividuals

linkageanalysis

(or association)

associationanalysis

Page 5: Statistical Issues in Genetic Association Studies

Basic study designs for gene mapping

families unrelatedindividuals

associationanalysis

semi-relatedindividuals inan inbred population

?

linkageanalysis

(or association)

Page 6: Statistical Issues in Genetic Association Studies

Basic study designs for gene mapping

families unrelatedindividuals

associationanalysis

semi-relatedindividuals inan inbred population

?

linkageanalysis

(or family-based association)

Page 7: Statistical Issues in Genetic Association Studies

Association analysis (circa 2000)

1) Collect cases and controls.

aa

2) Genotype everyone at a marker.

AAAa

aa

AA

AAAa

aa

AAAA

aa Aa

aaAa

Aa

aaAA

3) Test genotype/phenotype association.

AA Aa aa

cases 65 133 202

controls 16 81 316

4) Call it a day and go out for a beer with your co-investigators.

Page 8: Statistical Issues in Genetic Association Studies

GWAS Study circa 2010

AA Aa aa

cases 65 133 202

controls 16 81 316

1) Collect cases and controls. 2) Genotype everyone at a marker.

AAAa

aa

AA

AAAa

aa

AAAA

aa Aa

aaAa

Aa

aaAA

3) Test genotype/phenotype association.4) Call it a day and go out for a beer with your co-investigators.

Repeat1,000,000times!

Page 9: Statistical Issues in Genetic Association Studies

So what’s the BIG DEAL?

Well, not much, until you get into

1) the complexities of array data, and2) the real science of genetics.

Page 10: Statistical Issues in Genetic Association Studies

One important genetic subtlety

Even in a GWAS study, we can’t test every variant on the genome. So

1) at the design phase, we have to pick markers (SNPs) that we hope will “cover” as well as possible, and

2) at the testing phase, we do not expect that the marker we are testing is actually the “causal variant” - we are usually hoping (at best) that it is correlated with the true causal genetic variable.

Page 11: Statistical Issues in Genetic Association Studies
Page 12: Statistical Issues in Genetic Association Studies

Gene inhere somewhere

Page 13: Statistical Issues in Genetic Association Studies

Gene inhere somewhere

Page 14: Statistical Issues in Genetic Association Studies

After many generations ...

Page 15: Statistical Issues in Genetic Association Studies

Within a population, genotypes at nearby SNPs are correlated due to population history.

This correlation is called linkage disequilibrium.

Page 16: Statistical Issues in Genetic Association Studies

“Tag” SNPs

Find a set of SNPs that captures most information at least cost.

How? Find clusters of SNPs that are highly correlated and then choose one representative from each cluster to genotype.

Easily-available relatively idiot-proof software (e.g. Tagger).

Caveat 1:You need a database that knows lots of SNPs in your gene and has genotyped them in a fair number of people in the population you are studying (Hapmap, Seattle SNPs).

Caveat 2:Beware of overly-aggressive “tagging.”

Page 17: Statistical Issues in Genetic Association Studies

Conventional association vs. candidate gene sequencing

GWAS (tag SNP) study

1) Cheaper - more genes and more people, so higher power.

2) Find only common variation.

3) Probably do not find functional variants.

Candidate gene sequencing study

1) Expensive - fewer genes and fewer people, so lower power overall.

2) Find both common and rare variation.

3) Find functional variants.

Page 18: Statistical Issues in Genetic Association Studies

GWAS Analysis

Genotype calling

Data cleaning

Single-SNP analysis

Other analyses

CNVs

Page 19: Statistical Issues in Genetic Association Studies

BB

AB

AA

Genotype “calling”

Generally done before you see the data.

But plenty of open questionsabout how to do it.

- best clustering methods?- salvage data from messy

clusters?

Page 20: Statistical Issues in Genetic Association Studies
Page 21: Statistical Issues in Genetic Association Studies

Data cleaning

Somewhat dependent on which chip you are using.

Throw out “bad” SNPs and “bad” samples. (% of genotypes “called” for each person and each SNP)

Hardy-Weinberg testing

Relationship testing

Find major chromosomal anomalies

Look for population stratification

Look for signs of systematic problems (e.g. allele frequencies differ by sample processing date).

Page 22: Statistical Issues in Genetic Association Studies

Data cleaning examples

Page 23: Statistical Issues in Genetic Association Studies

Plate effect on missing call rate per sampleANOVA p-value = 6e-48But no significant association between plate and case status (p=0.20)

Page 24: Statistical Issues in Genetic Association Studies

Gender Check

Page 25: Statistical Issues in Genetic Association Studies

chromosomal anomalies

Page 26: Statistical Issues in Genetic Association Studies

Testing Hardy-Weinberg

Hardy-Weinberg Equilibrium (HWE) means that your three genotype groups occur in the expected p2, 2pq, q2 proportions.

Departure from HWE most often indicates genotyping problems.

But it can also indicate an actual genetic effect.

(Check for case-control differences).

Do your HWE tests by ethnicity, but don’t expect admixed groups (hispanics, African-Americans) to be in HWE.

Page 27: Statistical Issues in Genetic Association Studies

HWE 10-4 < p < 0.5

Page 28: Statistical Issues in Genetic Association Studies

HWE p < 10-4

Page 29: Statistical Issues in Genetic Association Studies

population stratification via principle components

Page 30: Statistical Issues in Genetic Association Studies

Analysis

Page 31: Statistical Issues in Genetic Association Studies

case

control

A a Case-control association test by allele ...

And by genotype ...

2 x 2 table(Fisher’s exact test or chi-squared test)

2 x 3 table(Fisher’s exact test or chi-squared test orArmitage trend test)

case

control

AA Aa aa

Simple association test at every SNP

Page 32: Statistical Issues in Genetic Association Studies

Or use logistic regression

Lets you incorporate other predictors (age, sex, diet, whatever).

G + E (genotype + environment model)

G + E + GxE (interaction model)

Page 33: Statistical Issues in Genetic Association Studies

GWAS results

Manhattan plot and

qq plot

Page 34: Statistical Issues in Genetic Association Studies

What’s the best single-SNP association test?

Not as “solved” a problem as you’d think.

If you knew the true model for the gene effect, you’d just fit that model. But you don’t.

So which tests are robust over lots of models?

Page 35: Statistical Issues in Genetic Association Studies

Chia-Ling Kuo’s work

===== MIN 2P ============= MIN 3P

==================== MIN 4P ==============

Page 36: Statistical Issues in Genetic Association Studies

Scan with Covariates• Which logistic regression

model is best for testing GENETIC EFFECT?

– G: LR(G, NULL) ~ X2(1)– G+E: LR(G+E, E) ~ X2(1)– G+E+GE: LR(G+E+GE, E) ~ X2(2)

Page 37: Statistical Issues in Genetic Association Studies

Results

1) Combination statistics (best of several statistics) are most robust, even after correction for multiple comparisons, but linear trend test is also a good choice.

2) To test for genetic effect, the G + E is almost never advantageous. Just test G, or fit G + E + GxE if you’re pretty sure there’s an interaction. BIG CAVEAT: This assumes G and E are independent – if you are worried about confounding, you DO need to control for E when testing G.

Page 38: Statistical Issues in Genetic Association Studies

More generally, should you use the same statistics you used for a small-scale study?

Maybe not.

Problem

Need to worry about the statistical propertiesof the extreme values ofthe test statistics.

What do I mean?

• Statisticians develop teststhat behave sensibly on average.

• But in genomic problems, we do 10,000 or 500,000 of the same test and then follow up thetop 100 results.

• So we need test statistics for whichthe extreme values are well-behaved,not so much the averages.

Page 39: Statistical Issues in Genetic Association Studies

Example from expression arrays:“10,000 t-tests” analysis

• Compute t-statistic for each gene.• Rank by absolute value of t-statistic.

Page 40: Statistical Issues in Genetic Association Studies

Problem

Ranked list is dominatedby small-variance genes.

With a small sample size,the SE estimates are very poor.

If you estimate an SE poorly 10,000 times, some of the estimates will come out very small.

2 ways to get a large t-statistic

1) large difference between the means

2) small SE

Page 41: Statistical Issues in Genetic Association Studies

Solution

Shrinkage estimator!

(Add a fudge factor to the denominator of the t-statistic.)

Page 42: Statistical Issues in Genetic Association Studies

Back to association studies ...

Whatever statistic you are using (1,000,000 times), you need to know the statistical behavior of the 1st - 50th highest order statistics, not the statistical behavior on average.

This issue has not really been dealt with in the association study literature.

Page 43: Statistical Issues in Genetic Association Studies

A few other open statistical issues

Page 44: Statistical Issues in Genetic Association Studies

Multiple testing

The problem

If you do 1,000,000 tests, you will produce a lot of false positives.

The solution

There isn’t one!

• Be realistic about hypothesis generating vs. hypothesis testing.

• False discovery rate - controls percent of genes on list that are false.

• Permutation testing - controls for lots of correlated tests.

Page 45: Statistical Issues in Genetic Association Studies

“Imputaton” at untyped SNPs

The idea

Use Hapmap database to impute genotypes for your samples at all the SNPs in-between the ones you genotyped.

Do a test at each of those SNPs in addition to the typed ones.

Should increase overall study power even if multiple comparisons are correctly controlled for.

typed SNPuntyped SNP

“blue” at typed SNP => “blue” at untyped one as well

Page 46: Statistical Issues in Genetic Association Studies

“Imputaton” at untyped SNPs

The best thing

Allows joint analyses of datasets that were genotyped with different chips!

Limitations

Only helpful if correlation structure in Hapmap is valid for your population.Only helpful for SNPs in the database (contrast to haplotype analysis).

Open questions

• Best imputation methods in theory and practice?• What populations should you base the imputation on?• Imputed SNPs have different statistical properties (e.g. slightly higher variance) – how do we account for that?

Page 47: Statistical Issues in Genetic Association Studies

Meta-analysis

Typical GWAS papers now combine results from many studies.

What are the best meta-analysis methods for doing this?

- What if same SNPs not typed in all studies?- What if phenotype not measured the same way?- What if some SNPs are imputed?

Page 48: Statistical Issues in Genetic Association Studies

Software for genetic association studies

PLINK is the primary tool. Bioinformatics is incorporated.

There are some useful R packages as well.

Need R for fancier analyses – typically integrate it with PLINK.

Lots of new stuff constantly under development for large-scale data management and viewing – WGAViewer, LocusZoom

Lots of specialty packages for:HWE haplotype analysisfamily associationother stuff