34
1 Design and Analysis of Genome-Wide Association Studies Tasha E. Fingerlin Departments of Epidemiology and Biostatistics & Informatics Colorado School of Public Health University of Colorado Denver Workshop on Statistical Genetics and Genomics February 12, 2009

Genetic Association Studies

  • Upload
    pammy98

  • View
    1.192

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Genetic Association Studies

1

Design and Analysis of Genome-Wide Association Studies

Tasha E. FingerlinDepartments of Epidemiology and Biostatistics & Informatics

Colorado School of Public Health

University of Colorado Denver

Workshop on Statistical Genetics and Genomics

February 12, 2009

Page 2: Genetic Association Studies

2

Today

• Goal of a genetic association study

• Rationale for genome-wide association studies

• Design and analysis considerations for GWAs

• Application to two clinically similar granulomatous lung diseases

Page 3: Genetic Association Studies

3

Complex Traits - Multifactorial Inheritance

• Examples– Some cancers - Schizophrenia

– Type 1 diabetes - Cleft lip/palate

– Type 2 diabetes - Hypertension

– Alzheimer disease - Rheumatoid arthritis

– Inflammatory bowel disease - Asthma

Genetic Variants

Non-genetic factors

TraitTrait

Page 4: Genetic Association Studies

4

Genetic Association Studies

• Short-term Goal: Identify genetic variants that explain differences in phenotype among individuals in a study population

– Qualitative: disease status, presence/absence of congenital defect

– Quantitative: blood glucose levels, % body fat

• If association found, then further study can follow to

– Understand mechanism of action and disease etiology in individuals

– Characterize relevance and/or impact in more general population

• Long-term goal: to inform process of identifying and delivering better prevention and treatment strategies

Page 5: Genetic Association Studies

5

DNA Variation

• >99.9 % of the sequence is identical between any two chromosomes.

- Compare maternal and paternal chromosome 1 in single person

- Compare Y chromosomes between two unrelated males

• Even though most of the sequence is identical between two chromosomes, since the genome sequence is so long (~3 billion base pairs), there are still many variations.

• Some DNA variations are responsible for biological changes, others have no known function.

• Alleles are the alternative forms of a DNA segment at a given genetic location.

• Genetic polymorphism: DNA segment with 2 common alleles.

Page 6: Genetic Association Studies

6

• SNPs – DNA sequence variations that occur when a single nucleotide is altered

• Alleles at this SNP are “G” and “T”

• SNPs are the most common form of variation in the human genome

• SNPs catalogued in several databases

Single Nucleotide Polymorphisms: SNPs

A T G A C A G G C

A T G A C A T G C

Page 7: Genetic Association Studies

7

• Genotype: pair of alleles (one paternal, one maternal) at a locus

Genotype for this individual is GT

• Haplotype: sequence of alleles along a single chromosome

Genotypes for this individual (vertical) : CA and TT

Haplotypes (horizontal): CT and AT

Genotypes and Haplotypes

A T G A C A G G C

A T G A C A T G C

Maternal

Paternal

A T G C C A T G C

A T G A C A T G C

Maternal

Paternal

Page 8: Genetic Association Studies

8

Scope of a Genetic Association Study

• Candidate gene– Known functional variants

– Variants with unknown function in exons, introns, regulatory regions

• Linkage candidate region– Functional variants, or those with unknown function in candidate genes

– More general coverage of region using many markers

• Genome-wide– Test for association with hundreds of thousands (millions) of SNPs spread

across the entire genome.

– Many design strategies possible for distributing markers

* Sabeti PC et al. (2002). Nature 419: 832-837

Page 9: Genetic Association Studies

9

Genome-Wide Association Studies

Rationale:

• Linkage analysis using families takes unbiased look at whole genome, but is underpowered for the size of genetic effects we expect to see for many complex genetic traits.

• Candidate gene association studies have greater power to identify smaller genetic effects, but rely on a priori knowledge about disease etiology.

• Genome-wide association studies combine the genomic coverage of linkage analysis with the power of association to have much better chance of finding complex trait susceptibility variants.

Page 10: Genetic Association Studies

10

Why are They Possible Now?

Genotyping Technology:

• Now have ability to type hundreds of thousands (or millions) of SNPs in one reaction on a “SNP chip.” The cost can be as low as $200-$300 per person.

• Two primary platforms: Affymetrix and Illumina.

Design and analysis:

• Availability of SNP databases, HapMap, and other resources to identify the SNPs and design SNP chips.

• Faster computers to carry out the millions of calculations make implementation possible.

Page 11: Genetic Association Studies

11

Design and Analysis Strategies: Moving Target

• A genetic factor is like any other potential risk factor and the same study design and analysis principles hold – in addition to those specific to GWAs.

• Standard case-control (matched or unmatched), cohort-based quantitative trait and longitudinal designs are common.

• In what follows, I will talk about current ideas and methods, with a focus on assumptions and quality control.

• Focus today is on case-control design, but many of the principles apply to other designs.

Page 12: Genetic Association Studies

12

SNP Chips: Number and Placement of SNPs

• A “typical” SNP chip has at least 317,000 SNPs distributed across the genome. Newest: ~1 million.

• The newest chips can also measure (directly or indirectly) some types of copy number variation.

• We do not directly measure genotypes at all genetic polymorphisms, but rely on association between the polymorphisms we do assay and those which we do not assay.

• SNP-SNP association, or linkage disequilibrium, is fundamental to our ability to sample the whole genome with relatively few SNPs.

Page 13: Genetic Association Studies

13

Linkage Disequilibrium (LD)

• Linkage disequilibrium: the non-random association of alleles at linked loci.

• A measure of the tendency of some alleles to be inherited together on haplotypes descended from ancestral chromosomes.

• If these where the only two haplotypes in the population, then alleles G and A ( C and T) are in perfect linkage disequilibrium.

• If we genotype the first SNP, we know what the alleles are at the second SNP.

A T G A C A A G C

A T C A C A T G C

Page 14: Genetic Association Studies

14

• In general, LD between two SNPs decreases with physical distance

• Extent of LD varies greatly depending on region of genome

• If LD strong, need fewer SNPs to capture variation in a region

Page 15: Genetic Association Studies

15

www.hapmap.org

Page 16: Genetic Association Studies

16

HapMap

• Multi-country effort to identify, catalog common human genetic variants.

• Developed to better understand and catalogue LD patterns across the genome in several populations.

• Genotyped ~4 million SNPs on samples of African, east Asian, European ancestry.

• All genotype data in a publicly available data base.

• Can download the genotype data

– Able to examine LD patterns across genome

– Can estimate approximate coverage of a given SNP chip

• Can represent 80-90% of common SNPs with

~300,000 tag SNPs for European or Asian samples

~500,000 tag SNPs for African samples

Page 17: Genetic Association Studies

17

• Case and control samples may be population-based

• Cases and controls may be chosen to increase magnitude of contrast

Case sample may be selected to be enriched for predisposing variant(s)

- Family history - Early age of onset

- Increased severity of disease

Control sample may be selected to be “very healthy” or “super controls”

- E.g. for type 2 diabetes, may select individuals who have normal response to glucose at age 70

- Control selection just as important (and tricky) as for any case-control study.

Case and Control Selection

Page 18: Genetic Association Studies

18

Testing for Genetic Association with Disease

• Question of interest: Are the alleles or genotypes at a genetic marker associated with disease status?

• Use usual statistical machinery get estimates of measures of association and to test for association for each of the SNPs.

• One typical approach: Test for association between having 0, 1 or 2 copies of rare allele at a SNP using Cochran-Armitage test for trend.

Pearson, T. A. et al. JAMA 2008;299:1335-1344.

Page 19: Genetic Association Studies

19

Interpreting the Statistical Results

• Testing for association at each of hundreds of thousands of markers dictates that traditional statistical significance thresholds (e.g. =.05) not appropriate.

• That aside (more in a few minutes), if you identify a SNP that is significantly associated with disease, there are three possibilities:

– There is a causal relationship between SNP and disease

– The marker is in linkage disequilibrium with a causal locus

– False positive

• Many potential sources of systematic errors that might lead to false positive results.

• Genotyping quality control issues particularly important.

Page 20: Genetic Association Studies

20

Confounding by Ancestry(a.k.a. Population Stratification)

• Control selection critical as always

• Confounding by ancestry: Distortion of the relationship between the genetic risk factor and the outcome of interest due to ancestry that is related to both the frequency of the putative genetic risk factor and whether or not subject is a case or a control.

Genetic Risk Factor Case/Control Status

Ancestry

Page 21: Genetic Association Studies

21

Population Stratification

• Distribution of genotypes differs between cases and controls

• Might conclude that allele A (or genotype AA) related to disease

Cases Controls

TT

AT

AA

Genotype

Page 22: Genetic Association Studies

22

Population Stratification

• If cases and controls not well-matched ancestrally

– Unequal distribution of non-disease-related alleles between cases and controls

– Any allele more common in population with increased risk of disease may appear to be associated with disease

Cases Controls Genotype

Pop 1 Pop 1

Pop 2 Pop 2

TT

AT

AA

Page 23: Genetic Association Studies

23

Population Stratification

• Unequal distribution of alleles may result from

– Sample made up of more than one distinct population

– Sample made up of individuals with differing levels of admixture

Parra et al. AJHG 63:1839, 1998

Page 24: Genetic Association Studies

24

Using the GWA Data to Avoid Population Stratification

• Several options exist to allow controlling for ancestry using markers across the genome

• All based on idea that stratification should exist across the genome, and that we can use the information on the genome-wide markers to

- estimate ancestry groups, remove extreme outliers, control for other variation

- estimate inflation of test statistic and adjust all test statistics

• In each case, assumes constant effect of ancestry, which may or may not be appropriate

• Bottom line is that with genome of data, can do a very good job of understanding potential for and minimizing impact of population substructure.

Page 25: Genetic Association Studies

25

Potential Solutions to Multiple Testing Issue

• Bonferroni correction– Assume all tests performed are independent– Estimate number of independent polymorphisms in genome– Threshold often considered appropriate: 5x10-8.

• Other less conservative allocation of experiment-wide over the genome – Perhaps spend more on linkage regions or for SNPs in coding regions of

gene

• Permutation• Implementation for case-control study: permute case and control

status, perform all tests record the most significant p-value among those tests and then re-permute case-control status and test again. Repeat many times.

• P-value for most significant test is the proportion of permutations that had a “best” p-value as small or smaller than the one you observe with the observed data (the data with the right case and control labels).

Page 26: Genetic Association Studies

26

Q-Q Plot• If points deviate (significantly?) from

line of equality indicate that the two distributions are different.

• Some will take point at which the observed p-values differ from the expected as the point to declare statistical significance.

Important points:

• Can have deviation from line that is indicative of violated assumptions (e.g. existence of population stratification)

• In tails of distribution, have less information, and so might require large divergence from expected

Figure from: G. Abecasis

Page 27: Genetic Association Studies

27

Using Multiple Samples

Rationale: Given the very large number of tests performed, use multiple samples as a way to reduce the expected number of false positive results at that end of the study.

• Split-sample

Approach: Rather than testing entire sample on entire genome, test for association with some proportion of your samples and then test some proportion of those markers in the rest of your samples*.

• Independent samples

Approach: Rather than split your own sample, use another independent sample.

• In either case, can dramatically reduce number of false positive results while maintaining power.

Page 28: Genetic Association Studies

28

Granulomatous Lung Diseases

• Chronic Beryllium Disease (CBD)

Exposure to beryllium results in formation of granulomas in lung among some individuals

• Sarcoidosis

Unknown exposure(s) result in granuloma formation and inflammation in lung, but other organs often involved

Page 29: Genetic Association Studies

29

Hypothesis

Sarcoidosis and CBD share genetic factors important in their similar granulomatous inflammatory pathways

CBD

Sarcoidosis

Disease Severity

Disease Risk

Page 30: Genetic Association Studies

30

GWA : Preliminary data

Page 31: Genetic Association Studies

31

Top Region for CBD p=10-11

Page 32: Genetic Association Studies

32

Region Shared by both CBD and Sarcoidosis on 8p23.2 p=10-2 – 10-4

Page 33: Genetic Association Studies

33

Other Important Topics of Present and Future

• Imputation

• Careful consideration of non-genetic factors

• Investigation of interactions: gene-environment and gene-gene

• Sequencing data

Page 34: Genetic Association Studies

34

Acknowledgements

Wake Forest University University of Michigan

Carl D. Langefeld, PhD Michael Boehnke, PhD Goncalo R. Abecasis, PhD

National Jewish Health

Lisa Maier, MPH, MDLori Silveira, MS