Statistical Genetics Workshop Tasha Fingerlin

Embed Size (px)

Citation preview

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    1/34

    1

    Design and Analysis of Genome-WideAssociation Studies

    Tasha E. FingerlinDepartments of Epidemiology and Biostatistics & Informatics

    Colorado School of Public Health

    University of Colorado Denver

    Workshop on Statistical Genetics and Genomics

    February 12, 2009

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    2/34

    2

    Today

    Goal of a genetic association study

    Rationale for genome-wide association studies

    Design and analysis considerations for GWAs

    Application to two clinically similar granulomatous lung diseases

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    3/34

    3

    Complex Traits - Multifactorial Inheritance

    Examples

    Some cancers - Schizophrenia

    Type 1 diabetes - Cleft lip/palate

    Type 2 diabetes - Hypertension

    Alzheimer disease - Rheumatoid arthritis

    Inflammatory bowel disease - Asthma

    Genetic

    Variants

    Non-genetic

    factors

    Trait

    Trait

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    4/34

    4

    Genetic Association Studies

    Short-term Goal: Identify genetic variants that explain differences inphenotype among individuals in a study population

    Qualitative: disease status, presence/absence of congenital defect

    Quantitative: blood glucose levels, % body fat

    If association found, then further study can follow to

    Understand mechanism of action and disease etiology in individuals

    Characterize relevance and/or impact in more general population

    Long-term goal: to inform process of identifying and delivering better

    prevention and treatment strategies

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    5/34

    5

    DNA Variation

    >99.9 % of the sequence is identical between any two chromosomes.- Compare maternal and paternal chromosome 1 in single person

    - Compare Y chromosomes between two unrelated males

    Even though most of the sequence is identical between two chromosomes,

    since the genome sequence is so long (~3 billion base pairs), there are stillmany variations.

    Some DNA variations are responsible for biological changes, others have no

    known function.

    Alleles are the alternative forms of a DNA segment at a givengenetic

    location.

    Genetic polymorphism: DNA segment with 2 common alleles.

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    6/34

    6

    SNPs DNA sequence variations that occur when a single nucleotide

    is altered

    Alleles at this SNP are G and T

    SNPs are the most common form of variation in the human genome

    SNPs catalogued in several databases

    Single Nucleotide Polymorphisms: SNPs

    A T G A C A G G C

    A T G A C A T G C

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    7/34

    7

    Genotype: pair of alleles (one paternal, one maternal) at a locus

    Genotype for this individual is GT

    Haplotype: sequence of alleles along a single chromosome

    Genotypes for this individual (vertical) : CA and TT

    Haplotypes (horizontal): CT and AT

    Genotypes and Haplotypes

    A T G A C A G G C

    A T G A C A T G C

    Maternal

    Paternal

    A T G C C A T G C

    A T G A C A T G C

    Maternal

    Paternal

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    8/34

    8

    Scope of a Genetic Association Study

    Candidate gene

    Known functional variants Variants with unknown function in exons, introns, regulatory regions

    Linkage candidate region Functional variants, or those with unknown function in candidate genes

    More general coverage of region using many markers

    Genome-wide

    Test for association with hundreds of thousands (millions) of SNPs spread

    across the entire genome.

    Many design strategies possible for distributing markers

    * Sabeti PC et al. (2002).Nature 419: 832-837

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    9/34

    9

    Genome-Wide Association Studies

    Rationale:

    Linkage analysis using families takes unbiased look at whole genome,

    but is underpowered for the size of genetic effects we expect to see

    for many complex genetic traits.

    Candidate gene association studies have greater power to identify

    smaller genetic effects, but rely on a priori knowledge about disease

    etiology.

    Genome-wide association studies combine the genomic coverage of

    linkage analysis with the power of association to have much better

    chance of finding complex trait susceptibility variants.

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    10/34

    10

    Why are They Possible Now?

    Genotyping Technology:

    Now have ability to type hundreds of thousands (or millions) of SNPs

    in one reaction on a SNP chip. The cost can be as low as $200-

    $300 per person.

    Two primary platforms: Affymetrix and Illumina.

    Design and analysis:

    Availability of SNP databases, HapMap, and other resources toidentify the SNPs and design SNP chips.

    Faster computers to carry out the millions of calculations make

    implementation possible.

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    11/34

    11

    Design and Analysis Strategies: Moving Target

    A genetic factor is like any other potential risk factor and the samestudy design and analysis principles hold in addition to those

    specific to GWAs.

    Standard case-control (matched or unmatched), cohort-based

    quantitative trait and longitudinal designs are common.

    In what follows, I will talk about current ideas and methods, with a

    focus on assumptions and quality control.

    Focus today is on case-control design, but many of the principles

    apply to other designs.

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    12/34

    12

    SNP Chips: Number and Placement of SNPs

    A typical SNP chip has at least 317,000 SNPs distributed across the

    genome. Newest: ~1 million.

    The newest chips can also measure (directly or indirectly) some types of

    copy number variation.

    We do not directly measure genotypes at all genetic polymorphisms, but

    rely on association between the polymorphisms we do assay and those

    which we do not assay.

    SNP-SNP association, or linkage disequilibrium, is fundamental to our

    ability to sample the whole genome with relatively few SNPs.

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    13/34

    13

    Linkage Disequilibrium (LD)

    Linkage disequilibrium: the non-random association of alleles at

    linked loci.

    A measure of the tendency of some alleles to be inherited together on

    haplotypes descended from ancestral chromosomes.

    If these where the only two haplotypes in the population, then alleles

    G and A ( C and T) are in perfect linkage disequilibrium.

    If we genotype the first SNP, we know what the alleles are at the

    second SNP.

    A T G A C A A G C

    A T C A C A T G C

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    14/34

    14

    In general, LD between two SNPs decreases with physical distance

    Extent of LD varies greatly depending on region of genome

    If LD strong, need fewer SNPs to capture variation in a region

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    15/34

    15

    www.hapmap.org

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    16/34

    16

    HapMap

    Multi-country effort to identify, catalog common human geneticvariants.

    Developed to better understand and catalogue LD patterns across thegenome in several populations.

    Genotyped ~4 million SNPs on samples of African, east Asian,European ancestry.

    All genotype data in a publicly available data base.

    Can download the genotype data

    Able to examine LD patterns across genome

    Can estimate approximate coverage of a given SNP chip

    Can represent 80-90% of common SNPs with

    ~300,000 tag SNPs for European or Asian samples

    ~500,000 tag SNPs for African samples

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    17/34

    17

    Case and control samples may be population-based

    Cases and controls may be chosen to increase magnitude of contrast

    Case sample may be selected to be enriched for predisposing variant(s)

    - Family history- Early age of onset

    - Increased severity of disease

    Control sample may be selected to be very healthy or super controls

    - E.g. for type 2 diabetes, may select individuals whohave normal response to glucose at age 70

    - Control selection just as important (and tricky) as forany case-control study.

    Case and Control Selection

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    18/34

    18

    Testing for Genetic Association with Disease

    Question of interest: Are the alleles or genotypes at a genetic marker

    associated with disease status?

    Use usual statistical machinery get estimates of measures of

    association and to test for association for each of the SNPs.

    One typical approach: Test for association between having 0, 1 or 2

    copies of rare allele at a SNP using Cochran-Armitage test for trend.

    Pearson, T. A. et al. JAMA 2008;299:1335-1344.

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    19/34

    19

    Interpreting the Statistical Results

    Testing for association at each of hundreds of thousands of markers dictates

    that traditional statistical significance thresholds (e.g. =.05) notappropriate.

    That aside (more in a few minutes), if you identify a SNP that is

    significantly associated with disease, there are three possibilities:

    There is a causal relationship between SNP and disease

    The marker is in linkage disequilibrium with a causal locus

    False positive

    Many potential sources of systematic errors that might lead to false positive

    results.

    Genotyping quality control issues particularly important.

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    20/34

    20

    Confounding by Ancestry

    (a.k.a. Population Stratification)

    Control selection critical as always

    Confounding by ancestry: Distortion of the relationship between the genetic

    risk factor and the outcome of interest due to ancestry that is related to both

    the frequency of the putative genetic risk factor and whether or not subject is

    a case or a control.

    Genetic Risk Factor Case/Control Status

    Ancestry

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    21/34

    21

    Population Stratification

    Distribution of genotypes differs between cases and controls

    Might conclude that allele A (or genotype AA) related to

    disease

    Cases Controls

    TT

    AT

    AA

    Genotype

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    22/34

    22

    Population Stratification

    If cases and controls not well-matched ancestrally Unequal distribution of non-disease-related alleles between cases and

    controls

    Any allele more common in population with increased risk of disease

    may appear to be associated with disease

    Cases Controls Genotype

    Pop 1 Pop 1

    Pop 2 Pop 2

    TT

    AT

    AA

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    23/34

    23

    Population Stratification

    Unequal distribution of alleles

    may result from

    Sample made up of morethan one distinct population

    Sample made up of

    individuals with differinglevels of admixture

    Parra et al. AJHG 63:1839, 1998

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    24/34

    24

    Using the GWA Data to Avoid Population Stratification

    Several options exist to allow controlling for ancestry using markers

    across the genome

    All based on idea that stratification should exist across the genome,

    and that we can use the information on the genome-wide markers to

    - estimate ancestry groups, remove extreme outliers,control for other variation

    - estimate inflation of test statistic and adjust all test statistics

    In each case, assumes constant effect of ancestry, which may or maynot be appropriate

    Bottom line is that with genome of data, can do a very good job of

    understanding potential for and minimizing impact of population

    substructure.

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    25/34

    25

    Potential Solutions to Multiple Testing Issue

    Bonferroni correction

    Assume all tests performed are independent

    Estimate number of independent polymorphisms in genome

    Threshold often considered appropriate: 5x10-8.

    Other less conservative allocation of experiment-wide over the genome

    Perhaps spend more on linkage regions or for SNPs in coding regionsof gene

    Permutation

    Implementation for case-control study: permute case and control

    status, perform all tests record the most significant p-value among

    those tests and then re-permute case-control status and test again.

    Repeat many times.

    P-value for most significant test is the proportion of permutations

    that had a best p-value as small or smaller than the one you

    observe with the observed data (the data with the right case andcontrol labels .

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    26/34

    26

    Q-Q Plot

    If points deviate (significantly?)

    from line of equality indicate that

    the two distributions are different.

    Some will take point at which the

    observed p-values differ from the

    expected as the point to declare

    statistical significance.

    Important points:

    Can have deviation from line that

    is indicative of violated

    assumptions (e.g. existence ofpopulation stratification)

    In tails of distribution, have less

    information, and so might require

    large divergence from expected

    Figure from: G. Abecasis

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    27/34

    27

    Using Multiple Samples

    Rationale: Given the very large number of tests performed, use multiple

    samples as a way to reduce the expected number of false positive results at

    that end of the study.

    Split-sample

    Approach: Rather than testing entire sample on entire genome, test for

    association with some proportion of your samples and then test someproportion of those markers in the rest of your samples*.

    Independent samples

    Approach: Rather than split your own sample, use another independentsample.

    In either case, can dramatically reduce number of false positive results

    while maintaining power.

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    28/34

    28

    Granulomatous Lung Diseases

    Chronic Beryllium Disease (CBD)

    Exposure to beryllium results in formation of granulomas in lung

    among some individuals

    Sarcoidosis

    Unknown exposure(s) result in granuloma formation and

    inflammation in lung, but other organs often involved

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    29/34

    29

    Hypothesis

    Sarcoidosis and CBD share genetic factors important in theirsimilar granulomatous inflammatory pathways

    CBD

    Sarcoidosis

    Disease Severity

    Disease Risk

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    30/34

    30

    GWA : Preliminary data

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    31/34

    31

    Top Region for CBDp=10-11

    R i Sh d b b th CBD d S id i

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    32/34

    32

    Region Shared by both CBD and Sarcoidosis on

    8p23.2 p=10-2 10-4

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    33/34

    33

    Other Important Topics of Present and Future

    Imputation

    Careful consideration of non-genetic factors

    Investigation of interactions: gene-environment and gene-gene

    Sequencing data

  • 7/31/2019 Statistical Genetics Workshop Tasha Fingerlin

    34/34

    34

    Acknowledgements

    Wake Forest University University of Michigan

    Carl D. Langefeld, PhD Michael Boehnke, PhD

    Goncalo R. Abecasis, PhD

    National Jewish HealthLisa Maier, MPH, MD

    Lori Silveira, MS