Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Lecture 3: Introduction to Association Analysis
02-‐715 Advanced Topics in Computa8onal Genomics
Genome Polymorphisms
Type of Polymorphisms
• Each variant is called an “allele”"• Almost always bi-allelic"• Account for most of the genetic diversi
ty among different (normal) individual, e.g. drug response, disease susceptibility
TCGAGGTATTAAC The ancestral chromosome
A Human Genealogy
TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC
* ** * *
From SNPS …
… To Haplotypes
A disease muta8on
Population-Based Association Study
• Case/control data are collected from unrelated individuals – All individuals are related if we go back far enough in the ancestry
Balding, Nature Reviews Gene8cs, 2006
Advantages of SNPs in Genetic Analysis of Complex Traits
• Abundance: high frequency on the genome
• Posi8on: throughout the genome – coding region, intron region, promoter site
• Ease of genotyping
• Less mutable than other forms of polymorphisms
• SNPs account for around 90% of human genomic varia8on
• About 10 million SNPs exist in human popula8ons • Most SNPs are outside of the protein coding regions
• 1 SNP every 600 base pairs
• More than 5 million common SNPs each with frequency 10-‐50% account for the bulk of human DNA sequence difference
• It is es8mated that ~60,000 SNPs occur within exons; 85% of exons are within 5 kb of the nearest SNP
Causal Mutations and Genetic Markers
X X X
SNP Marker Causal Muta8on
Linkage Disequilibrium
• Fine mapping required
Linkage Analysis vs. Association Analysis
Strachan & Read, Human Molecular Gene8cs, 2001
Overview
• Single SNP associa8on test • Discrete-‐valued phenotype: case/control study
• Con8nuous-‐valued phenotype: quan8ta8ve traits • Correc8ng for mul8ple tes8ng
• Leveraging linkage disequilibrium • Mul8marker associa8on test
• Genotype imputa8on method
Single SNP Association Analysis: Case/Control Study
• For each marker locus, find the 3x2 con8ngency table containing the counts of three genotypes
• test with 2 df, or Fisher’s exact test under the null hypothesis of no associa8on
Genotype Case Control AA Ncase,AA Ncontrol,AA Aa Ncase,Aa Ncontrol,Aa aa Ncase,aa Ncontrol,aa
Total Ncase Ncontrol
Genotype score = the number of minor alleles
€
2χ
Single SNP Association Analysis: Case/Control Study
• Alterna8vely, assume an addi8ve model, where the heterozygote risk is approximately between the two homozygotes
• Form a 2x2 con8ngency table. Each individual contributes twice from each of the two chromosomes.
• test with 1df
Genotype Case Control A Gcase,A Gcontrol,A a Gcase,a Gcontrol,a
Total 2xNcase 2xNcontrol
€
2χ
Single SNP Association Analysis: Continuous-valued Traits
• Con8nuous-‐valued traits – Also called quan8ta8ve traits – Cholesterol level, blood
pressure etc.
• For each locus, fit a linear regression using the number of minor alleles at the given locus of the individual as covariate
Genetic Model for Association
• Addi8ve effect – Major allele homozygote: 0
– Heterozygote: a + a x k – Minor allele homozygote: 2a
• k=1: dominant effect of the minor allele
• k=0: no dominance
• k=-‐1: dominant effect of the minor allele
Penetrance
• Propor8ons of individuals carrying a par8cular allele that possess an associated trait
• Alleles with high penetrance are easier to detect in associa8on analysis
Correcting for Multiple Testing
• What happens when we scan the genome of 1 million markers for associa8on with α = 0.05? – 50,000 (=1 millionx0.05) SNPs are expected to be found significant just
by chance
– We need to be more conserva8ve when we decide a given marker is significantly associated with the trait.
• Correc8on methods – Bonferroni correc8on – Permuta8on test
Bonferroni Correction
• If N markers are tested, we correct the significance level as α’= α/N – Assumes the N tests are independent, although this is not true
because of the linkage disequilibrium.
– Overly conserva8ve for 8ghtly linked markers
Permutation Procedure
• Step 1: Compute the test sta8s8c T using the original dataset
• Step 2: Set Nsig = 0 • Step 3: Repeat 1:Nperm
– Step 3a: Randomly permute the individuals in the phenotype data to generate datasets with no associa8on (retain the original genotype)
– Step 3b: Find the test sta8s8cs Tperm of SNPs using the permuted dataset
– Step 3c: if T> Tperm, Nsig = Nsig+1
• Step 4: Compute p-‐value as (1-‐Nsig/Nperm)
This approach is computa8onally demanding because olen a large Nperm is required.
Multi-marker Association Test
• Idea: a haplotype of mul8ple SNPs is a bemer proxy for a true causal SNP than a single SNP – Exploit the linkage disequilibrium structure in genome
• Form a new allele by combining mul8ple SNPs for a haplotype
• Test the haplotype allele for associa8on
SNP A SNP B 0 0 0 1 1 0 1 1
Auxiliary Markers for Haplotypes 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Multi-marker Association Test
• Mul8-‐marker approach can capture dependencies across mul8ple markers – SNPs in LD form a haplotype that can be tested as a single allele – Can achieve the same power with data collected for fewer samples
• Challenge as the size of haplotype increases – Haplotype of K SNPs results in 2K different haplotypes, but the number
of samples corresponding to each haplotype decreases quickly as we increase K
– Large K requires a large sample size
Imputation-Based Methods (Servin & Stephens, 2007)
Tag SNP Non-‐tag SNP
Yeast Genomic Datasets
• Yeast genomic datasets
-‐ Genotypes from 112 segregants from a yeast cross between BY and RM strains
-‐ Microarray gene-‐expression data
-‐ Transcrip8on factor binding site data -‐ Protein-‐protein interac8on data
Analysis Procedure (Zhu et al.)
• Gene expression data analysis to infer gene coexpression network
• eQTL analysis
• Learning a predic8ve model for yeast gene network – Integrate mul8ple genomic data to infer gene network
• gene expression/eQTL/TFBS/PPI data
Gene Coexpression Network
• Hierarchical clustering of genes
• Iden8fied gene modules
• How to validate the gene modules? – GO enrichment analysis as a proxy
Gene Set Enrichment Analysis
• Given a set of K genes, we would like to test whether these genes share a common func8on. – KEGG pathway, GO can serve as a proxy for a common func8on – Is a par8cular KEGG pathway or GO term enriched in our set of K genes
of interest?
• Analogy to urn model – Given an urn of N balls with M black balls and (N-‐M) white balls, we
are drawing n balls without replacement. What is the probability of drawing k black balls?
Gene Set Enrichment Test
• The universe of genes: N genes • In this universe, genes labled as GO term A: M genes
• Suppose we have a set of n genes for which we would like to test enrichment for GO term A – The probability of at most k genes to be labeled as GO term A:
Network Modules, GO Enrichment, eQTL Hotspots
eQTL Hotspots
• eQTL hotspots: pleiotropic control of mul8ple genes by a common genomic locus
• cis eQTL: affected genes are physically located in cis to the genomic locus
• trans eQTL: affected genes are located distantly from the eQTL
Network Modules, GO Enrichment, eQTL Hotspots
eQTL Hotspots
• No ground truth for eQTLs. How to validate the results? – Use results from knockout experiments, TFBS experiments as a proxy
– Again, gene set enrichment analysis
TFBS Target Enrichment, Knock-Out Signature Enrichment
Learning Bayesian Networks: Integrating Different Genomic Data
• Incorpora8ng more genomic data into network learning can increase the predic8ve power for regulators – Bayesian network I (BNraw)
• Derived from gene expression data
– Bayesian network II (BNqtl)
• Derived from gene expression, eQTL data
– Bayesian network III (BNfull)
• Derived from gene expression, eQTL, TFBS (ChIP-‐chip experiments), PPI data
Incorporating eQTLs in Network Learning
• A two step analysis: – First perform eQTL analysis
– Incorporate the iden8fied eQTLs in the network learning process
• For a given eQTL, genes with cis eQTLs can be parents of genes with trans eQTLs
• For a given eQTL, genes with trans eQTLs are not allowed to be parents of genes with cis eQTLs.
Computationally Identified Causal Regulators