Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Lecture 3: Introduction to Association Analysis

02-‐715 Advanced Topics in Computa8onal Genomics

Genome Polymorphisms

Type of Polymorphisms

•  Each variant is called an “allele”"•  Almost always bi-allelic"•  Account for most of the genetic diversi

ty among different (normal) individual, e.g. drug response, disease susceptibility

TCGAGGTATTAAC The ancestral chromosome

A Human Genealogy

TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC

* ** * *

From SNPS …

… To Haplotypes

A disease muta8on

Population-Based Association Study

•  Case/control data are collected from unrelated individuals –  All individuals are related if we go back far enough in the ancestry

Balding, Nature Reviews Gene8cs, 2006

Advantages of SNPs in Genetic Analysis of Complex Traits

•  Abundance: high frequency on the genome

•  Posi8on: throughout the genome –  coding region, intron region, promoter site

•  Ease of genotyping

•  Less mutable than other forms of polymorphisms

•  SNPs account for around 90% of human genomic varia8on

•  About 10 million SNPs exist in human popula8ons •  Most SNPs are outside of the protein coding regions

•  1 SNP every 600 base pairs

•  More than 5 million common SNPs each with frequency 10-‐50% account for the bulk of human DNA sequence difference

•  It is es8mated that ~60,000 SNPs occur within exons; 85% of exons are within 5 kb of the nearest SNP

Causal Mutations and Genetic Markers

X X X

SNP Marker Causal Muta8on

Linkage Disequilibrium

•  Fine mapping required

Linkage Analysis vs. Association Analysis

Strachan & Read, Human Molecular Gene8cs, 2001

Overview

•  Single SNP associa8on test •  Discrete-‐valued phenotype: case/control study

•  Con8nuous-‐valued phenotype: quan8ta8ve traits •  Correc8ng for mul8ple tes8ng

•  Leveraging linkage disequilibrium •  Mul8marker associa8on test

•  Genotype imputa8on method

Single SNP Association Analysis: Case/Control Study

•  For each marker locus, find the 3x2 con8ngency table containing the counts of three genotypes

•  test with 2 df, or Fisher’s exact test under the null hypothesis of no associa8on

Genotype Case Control AA Ncase,AA Ncontrol,AA Aa Ncase,Aa Ncontrol,Aa aa Ncase,aa Ncontrol,aa

Total Ncase Ncontrol

Genotype score = the number of minor alleles

€

2χ

Single SNP Association Analysis: Case/Control Study

•  Alterna8vely, assume an addi8ve model, where the heterozygote risk is approximately between the two homozygotes

•  Form a 2x2 con8ngency table. Each individual contributes twice from each of the two chromosomes.

•  test with 1df

Genotype Case Control A Gcase,A Gcontrol,A a Gcase,a Gcontrol,a

Total 2xNcase 2xNcontrol

€

2χ

Single SNP Association Analysis: Continuous-valued Traits

•  Con8nuous-‐valued traits –  Also called quan8ta8ve traits –  Cholesterol level, blood

pressure etc.

•  For each locus, fit a linear regression using the number of minor alleles at the given locus of the individual as covariate

Genetic Model for Association

•  Addi8ve effect –  Major allele homozygote: 0

–  Heterozygote: a + a x k –  Minor allele homozygote: 2a

•  k=1: dominant effect of the minor allele

•  k=0: no dominance

•  k=-‐1: dominant effect of the minor allele

Penetrance

•  Propor8ons of individuals carrying a par8cular allele that possess an associated trait

•  Alleles with high penetrance are easier to detect in associa8on analysis

Correcting for Multiple Testing

•  What happens when we scan the genome of 1 million markers for associa8on with α = 0.05? –  50,000 (=1 millionx0.05) SNPs are expected to be found significant just

by chance

–  We need to be more conserva8ve when we decide a given marker is significantly associated with the trait.

•  Correc8on methods –  Bonferroni correc8on –  Permuta8on test

Bonferroni Correction

•  If N markers are tested, we correct the significance level as α’= α/N –  Assumes the N tests are independent, although this is not true

because of the linkage disequilibrium.

–  Overly conserva8ve for 8ghtly linked markers

Permutation Procedure

•  Step 1: Compute the test sta8s8c T using the original dataset

•  Step 2: Set Nsig = 0 •  Step 3: Repeat 1:Nperm

–  Step 3a: Randomly permute the individuals in the phenotype data to generate datasets with no associa8on (retain the original genotype)

–  Step 3b: Find the test sta8s8cs Tperm of SNPs using the permuted dataset

–  Step 3c: if T> Tperm, Nsig = Nsig+1

•  Step 4: Compute p-‐value as (1-‐Nsig/Nperm)

This approach is computa8onally demanding because olen a large Nperm is required.

Multi-marker Association Test

•  Idea: a haplotype of mul8ple SNPs is a bemer proxy for a true causal SNP than a single SNP –  Exploit the linkage disequilibrium structure in genome

•  Form a new allele by combining mul8ple SNPs for a haplotype

•  Test the haplotype allele for associa8on

SNP A SNP B 0 0 0 1 1 0 1 1

Auxiliary Markers for Haplotypes 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

Multi-marker Association Test

•  Mul8-‐marker approach can capture dependencies across mul8ple markers –  SNPs in LD form a haplotype that can be tested as a single allele –  Can achieve the same power with data collected for fewer samples

•  Challenge as the size of haplotype increases –  Haplotype of K SNPs results in 2K different haplotypes, but the number

of samples corresponding to each haplotype decreases quickly as we increase K

–  Large K requires a large sample size

Imputation-Based Methods (Servin & Stephens, 2007)

Tag SNP Non-‐tag SNP

Yeast Genomic Datasets

•  Yeast genomic datasets

-‐  Genotypes from 112 segregants from a yeast cross between BY and RM strains

-‐  Microarray gene-‐expression data

-‐  Transcrip8on factor binding site data -‐  Protein-‐protein interac8on data

Analysis Procedure (Zhu et al.)

•  Gene expression data analysis to infer gene coexpression network

•  eQTL analysis

•  Learning a predic8ve model for yeast gene network –  Integrate mul8ple genomic data to infer gene network

•  gene expression/eQTL/TFBS/PPI data

Gene Coexpression Network

•  Hierarchical clustering of genes

•  Iden8fied gene modules

•  How to validate the gene modules? –  GO enrichment analysis as a proxy

Gene Set Enrichment Analysis

•  Given a set of K genes, we would like to test whether these genes share a common func8on. –  KEGG pathway, GO can serve as a proxy for a common func8on –  Is a par8cular KEGG pathway or GO term enriched in our set of K genes

of interest?

•  Analogy to urn model –  Given an urn of N balls with M black balls and (N-‐M) white balls, we

are drawing n balls without replacement. What is the probability of drawing k black balls?

Gene Set Enrichment Test

•  The universe of genes: N genes •  In this universe, genes labled as GO term A: M genes

•  Suppose we have a set of n genes for which we would like to test enrichment for GO term A –  The probability of at most k genes to be labeled as GO term A:

Network Modules, GO Enrichment, eQTL Hotspots

eQTL Hotspots

•  eQTL hotspots: pleiotropic control of mul8ple genes by a common genomic locus

•  cis eQTL: affected genes are physically located in cis to the genomic locus

•  trans eQTL: affected genes are located distantly from the eQTL

Network Modules, GO Enrichment, eQTL Hotspots

eQTL Hotspots

•  No ground truth for eQTLs. How to validate the results? –  Use results from knockout experiments, TFBS experiments as a proxy

–  Again, gene set enrichment analysis

TFBS Target Enrichment, Knock-Out Signature Enrichment

Learning Bayesian Networks: Integrating Different Genomic Data

•  Incorpora8ng more genomic data into network learning can increase the predic8ve power for regulators –  Bayesian network I (BNraw)

•  Derived from gene expression data

–  Bayesian network II (BNqtl)

•  Derived from gene expression, eQTL data

–  Bayesian network III (BNfull)

•  Derived from gene expression, eQTL, TFBS (ChIP-‐chip experiments), PPI data

Incorporating eQTLs in Network Learning

•  A two step analysis: –  First perform eQTL analysis

–  Incorporate the iden8fied eQTLs in the network learning process

•  For a given eQTL, genes with cis eQTLs can be parents of genes with trans eQTLs

•  For a given eQTL, genes with trans eQTLs are not allowed to be parents of genes with cis eQTLs.

Computationally Identified Causal Regulators

Documents

Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics