35
Lecture 3: Introduction to Association Analysis 02715 Advanced Topics in Computa8onal Genomics

Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Lecture 3: Introduction to Association Analysis

02-­‐715  Advanced  Topics  in  Computa8onal  Genomics  

Page 2: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Genome Polymorphisms

Page 3: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Type of Polymorphisms

•  Each variant is called an “allele”"•  Almost always bi-allelic"•  Account for most of the genetic diversi

ty among different (normal) individual, e.g. drug response, disease susceptibility

Page 4: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

TCGAGGTATTAAC The  ancestral  chromosome  

A Human Genealogy  

Page 5: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC

* ** * *

From SNPS …  

Page 6: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

… To Haplotypes  

A  disease  muta8on  

Page 7: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Population-Based Association Study

•  Case/control  data  are  collected  from  unrelated  individuals  –  All  individuals  are  related  if  we  go  back  far  enough  in  the  ancestry  

Balding,  Nature  Reviews  Gene8cs,  2006  

Page 8: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Advantages of SNPs in Genetic Analysis of Complex Traits

•  Abundance:  high  frequency  on  the  genome  

•  Posi8on:  throughout  the  genome    –  coding  region,  intron  region,  promoter  site  

•  Ease  of  genotyping  

•  Less  mutable  than  other  forms  of  polymorphisms  

•  SNPs  account  for  around  90%  of  human  genomic  varia8on  

•  About  10  million  SNPs  exist  in  human  popula8ons  •  Most  SNPs  are  outside  of  the  protein  coding  regions  

•  1  SNP  every  600  base  pairs  

•  More  than  5  million  common  SNPs  each  with  frequency  10-­‐50%  account  for  the  bulk  of  human  DNA  sequence  difference  

•  It  is  es8mated  that  ~60,000  SNPs  occur  within  exons;  85%  of  exons  are  within  5  kb  of  the  nearest  SNP  

Page 9: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Causal Mutations and Genetic Markers

X   X   X  

SNP  Marker  Causal  Muta8on  

Linkage  Disequilibrium  

•   Fine  mapping  required  

Page 10: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Linkage Analysis vs. Association Analysis

Strachan  &  Read,  Human  Molecular  Gene8cs,  2001  

Page 11: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Overview

•  Single  SNP  associa8on  test  •  Discrete-­‐valued  phenotype:  case/control  study  

•  Con8nuous-­‐valued  phenotype:  quan8ta8ve  traits  •  Correc8ng  for  mul8ple  tes8ng  

•  Leveraging  linkage  disequilibrium  •  Mul8marker  associa8on  test  

•  Genotype  imputa8on  method  

Page 12: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Single SNP Association Analysis: Case/Control Study

•  For  each  marker  locus,  find  the  3x2  con8ngency  table  containing  the  counts  of  three  genotypes  

•                 test  with  2  df,  or  Fisher’s  exact  test  under  the  null  hypothesis  of  no  associa8on    

Genotype Case Control AA Ncase,AA Ncontrol,AA Aa Ncase,Aa Ncontrol,Aa aa Ncase,aa Ncontrol,aa

Total Ncase Ncontrol

Genotype  score  =  the  number  of  minor  alleles    

Page 13: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Single SNP Association Analysis: Case/Control Study

•  Alterna8vely,  assume  an  addi8ve  model,  where  the  heterozygote  risk  is  approximately  between  the  two  homozygotes  

•  Form  a  2x2  con8ngency  table.  Each  individual  contributes  twice  from  each  of  the  two  chromosomes.  

•             test  with  1df  

Genotype Case Control A Gcase,A Gcontrol,A a Gcase,a Gcontrol,a

Total 2xNcase 2xNcontrol

Page 14: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Single SNP Association Analysis: Continuous-valued Traits

•  Con8nuous-­‐valued  traits  –  Also  called  quan8ta8ve  traits  –  Cholesterol  level,  blood  

pressure  etc.  

•  For  each  locus,  fit  a  linear  regression  using  the  number  of  minor  alleles  at  the  given  locus  of  the  individual  as  covariate  

Page 15: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Genetic Model for Association

•  Addi8ve  effect  –  Major  allele  homozygote:  0  

–  Heterozygote:  a  +  a  x  k  –  Minor  allele  homozygote:  2a  

•  k=1:  dominant  effect  of  the  minor  allele  

•  k=0:  no  dominance  

•  k=-­‐1:  dominant  effect  of  the  minor  allele  

Page 16: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Penetrance

•  Propor8ons  of  individuals  carrying  a  par8cular  allele  that  possess  an  associated  trait  

•  Alleles  with  high  penetrance  are  easier  to  detect  in  associa8on  analysis  

Page 17: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Correcting for Multiple Testing

•  What  happens  when  we  scan  the  genome  of  1  million  markers  for  associa8on  with  α  =  0.05?  –  50,000  (=1  millionx0.05)  SNPs  are  expected  to  be  found  significant  just  

by  chance  

–  We  need  to  be  more  conserva8ve  when  we  decide  a  given  marker  is  significantly  associated  with  the  trait.  

•  Correc8on  methods  –  Bonferroni  correc8on  –  Permuta8on  test  

Page 18: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Bonferroni Correction

•  If  N  markers  are  tested,  we  correct  the  significance  level  as  α’=  α/N  –  Assumes  the  N  tests  are  independent,  although  this  is  not  true  

because  of  the  linkage  disequilibrium.    

–  Overly  conserva8ve  for  8ghtly  linked  markers  

Page 19: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Permutation Procedure

•  Step  1:  Compute  the  test  sta8s8c  T  using  the  original  dataset  

•  Step  2:  Set  Nsig  =  0  •  Step  3:  Repeat  1:Nperm    

–  Step  3a:  Randomly  permute  the  individuals  in  the  phenotype  data  to  generate  datasets  with  no  associa8on  (retain  the  original  genotype)  

–  Step  3b:  Find  the  test  sta8s8cs  Tperm  of  SNPs  using  the  permuted  dataset  

–  Step  3c:  if  T>  Tperm,  Nsig  =  Nsig+1    

•  Step  4:  Compute  p-­‐value  as  (1-­‐Nsig/Nperm)  

This  approach  is  computa8onally  demanding  because  olen  a  large  Nperm  is  required.  

Page 20: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Multi-marker Association Test

•  Idea:  a  haplotype  of  mul8ple  SNPs  is  a  bemer  proxy  for  a  true  causal  SNP  than  a  single  SNP  –  Exploit  the  linkage  disequilibrium  structure  in  genome  

•  Form  a  new  allele  by  combining  mul8ple  SNPs  for  a  haplotype  

•  Test  the  haplotype  allele  for  associa8on  

SNP  A        SNP  B          0    0          0    1          1    0          1    1  

Auxiliary  Markers  for  Haplotypes    1  0                0        0    0  1                0        0    0  0                1        0    0  0                0        1  

Page 21: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Multi-marker Association Test

•  Mul8-­‐marker  approach  can  capture  dependencies  across  mul8ple  markers  –  SNPs  in  LD  form  a  haplotype  that  can  be  tested  as  a  single  allele  –  Can  achieve  the  same  power  with  data  collected  for  fewer  samples  

•  Challenge  as  the  size  of  haplotype  increases  –  Haplotype  of  K  SNPs  results  in  2K  different  haplotypes,  but  the  number  

of  samples  corresponding  to  each  haplotype  decreases  quickly  as  we  increase  K  

–  Large  K  requires  a  large  sample  size  

Page 22: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Imputation-Based Methods (Servin & Stephens, 2007)

Tag  SNP  Non-­‐tag  SNP  

Page 23: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Yeast Genomic Datasets

•  Yeast  genomic  datasets  

-­‐  Genotypes  from  112  segregants  from  a  yeast  cross  between  BY  and  RM  strains  

-­‐  Microarray  gene-­‐expression  data  

-­‐  Transcrip8on  factor  binding  site  data  -­‐  Protein-­‐protein  interac8on  data  

Page 24: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Analysis Procedure (Zhu et al.)

•  Gene  expression  data  analysis  to  infer  gene  coexpression  network  

•  eQTL  analysis  

•  Learning  a  predic8ve  model  for  yeast  gene  network  –  Integrate  mul8ple  genomic  data  to  infer  gene  network  

•  gene  expression/eQTL/TFBS/PPI  data  

Page 25: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Gene Coexpression Network

•  Hierarchical  clustering  of  genes  

•  Iden8fied  gene  modules  

•  How  to  validate  the  gene  modules?  –  GO  enrichment  analysis  as  a  proxy    

Page 26: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Gene Set Enrichment Analysis

•  Given  a  set  of  K  genes,  we  would  like  to  test  whether  these  genes  share  a  common  func8on.  –  KEGG  pathway,  GO  can  serve  as  a  proxy  for  a  common  func8on  –  Is  a  par8cular  KEGG  pathway  or  GO  term  enriched  in  our  set  of  K  genes  

of  interest?  

•  Analogy  to  urn  model  –  Given  an  urn  of  N  balls  with    M  black  balls  and  (N-­‐M)  white  balls,  we  

are  drawing  n  balls  without  replacement.  What  is  the  probability  of  drawing  k  black  balls?    

Page 27: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Gene Set Enrichment Test

•  The  universe  of  genes:  N  genes  •  In  this  universe,  genes  labled  as  GO  term  A:  M  genes  

•  Suppose  we  have  a  set  of  n  genes  for  which  we  would  like  to  test  enrichment  for  GO  term  A  –  The  probability  of  at  most  k  genes  to  be  labeled  as  GO  term  A:    

Page 28: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Network Modules, GO Enrichment, eQTL Hotspots

Page 29: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

eQTL Hotspots

•  eQTL  hotspots:  pleiotropic  control  of  mul8ple  genes  by  a  common  genomic  locus  

•  cis  eQTL:  affected  genes  are  physically  located  in  cis  to  the  genomic  locus  

•  trans  eQTL:  affected  genes  are  located  distantly  from  the  eQTL  

Page 30: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Network Modules, GO Enrichment, eQTL Hotspots

Page 31: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

eQTL Hotspots

•  No  ground  truth  for  eQTLs.  How  to  validate  the  results?  –  Use  results  from  knockout  experiments,  TFBS  experiments  as  a  proxy  

–  Again,  gene  set  enrichment  analysis  

Page 32: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

TFBS Target Enrichment, Knock-Out Signature Enrichment

Page 33: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Learning Bayesian Networks: Integrating Different Genomic Data

•  Incorpora8ng  more  genomic  data  into  network  learning  can  increase  the  predic8ve  power  for  regulators  –  Bayesian  network  I  (BNraw)  

•  Derived  from  gene  expression  data  

–  Bayesian  network  II  (BNqtl)  

•  Derived  from  gene  expression,  eQTL  data  

–  Bayesian  network  III  (BNfull)  

•  Derived  from  gene  expression,  eQTL,  TFBS  (ChIP-­‐chip  experiments),  PPI  data  

Page 34: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Incorporating eQTLs in Network Learning

•  A  two  step  analysis:  –  First  perform  eQTL  analysis  

–  Incorporate  the  iden8fied  eQTLs  in  the  network  learning  process  

•  For  a  given  eQTL,  genes  with  cis  eQTLs  can  be  parents  of  genes  with  trans  eQTLs  

•  For  a  given  eQTL,  genes  with  trans  eQTLs  are  not  allowed  to  be  parents  of  genes  with  cis  eQTLs.  

Page 35: Lecture 3: Introduction to Association Analysissssykim/teaching/f11/slides/Lecture3.pdf · 2011. 9. 6. · Lecture 3: Introduction to Association Analysis 02#715’Advanced’Topics’in’Computaonal’Genomics

Computationally Identified Causal Regulators