44
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail

SNPs, Haplotypes, Disease Associations

  • Upload
    cady

  • View
    87

  • Download
    0

Embed Size (px)

DESCRIPTION

SNPs, Haplotypes, Disease Associations. Algorithmic Foundations of Computational Biology II Course 1. Prof. Sorin Istrail. SNPs and the Human Genome: The Minimal Informative Subset. Overview. Introduction: SNPs, Haplotypes A Data Compression Problem: - PowerPoint PPT Presentation

Citation preview

Page 1: SNPs,  Haplotypes, Disease Associations

SNPs, Haplotypes,DiseaseAssociations

Algorithmic Foundations of Computational Biology II

Course 1

Prof. Sorin Istrail

Page 2: SNPs,  Haplotypes, Disease Associations

SNPs and the Human Genome:The Minimal Informative Subset

Page 3: SNPs,  Haplotypes, Disease Associations

Overview

Introduction:

SNPs, Haplotypes A Data Compression Problem:

The Minimum Informative Subset A New Measure:

Informativeness

Page 4: SNPs,  Haplotypes, Disease Associations

A Most Challenging Problem

“None of the [advances of the 20th century medicine] depend on a deep knowledge of cellular processes or on any discoveries of molecular biology.

Cancer is still treated by gross physical and chemical assaults on the offending tissue.

Cardiovascular Disease is treated by surgery whose anatomical bases go back to the 19th century …Of course, intimate knowledge of the living cell and of basic molecular processes may be usefuleventually.”

Lewontin (1991)

Page 5: SNPs,  Haplotypes, Disease Associations

Now

“A decade later, molecular biology can claim very few successes for drugs in clinical use that were designed ab initioto control a specific component of a pathwaylinked to disease: these include themonoclonal antibody Herceptin, and the kinase inhibitor Gleevec.”

Reik, Gregory and Urnov (2002)

Page 6: SNPs,  Haplotypes, Disease Associations

Introduction

SNPs, HAPLOTYPES

Page 7: SNPs,  Haplotypes, Disease Associations

A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%.

GATTTAGATCGCGATAGAGGATTTAGATCTCGATAGAG

The most abundant type of polymorphism

The two alleles at the site are G and T

Single Nucleotide Polymorphism (SNP)

Page 8: SNPs,  Haplotypes, Disease Associations

tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca

tc

ga

ga

ga

ga

ga

gc

gc

gc

tc

ga

ga

ga

ga

ga

tc

tc

tc

tc

ga

ga

ga

tc

gc

tc

tc

tc

Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes.

Two individuals are 99.9% the same. I.e. differ in ~ 3 M basepairs.

SNPs occur once every ~600 bp

Average gene in the human

genome spans ~27Kb

~50 SNPs per gene

Page 9: SNPs,  Haplotypes, Disease Associations

G C T C G A C A A C A GG T T C G T C A A C A G

Two individuals

C A G HaplotypesT T G

SNP SNP SNP

Haplotype

Page 10: SNPs,  Haplotypes, Disease Associations

Mutations

Infinite Sites Assumption:

Each site mutates at most once

Page 11: SNPs,  Haplotypes, Disease Associations

Haplotype Pattern

0 0 0 01 1 0 10 0 1 00 1 0 1

C A G TT T G AC A T GC T G T

At each SNP site label the two alleles as 0 and 1.

The choice which allele is 0 and which one is 1

is arbitrary.

Page 12: SNPs,  Haplotypes, Disease Associations

G T T C G A C T A T T A

G T T C G A C A A C A TA C G T A T C T A T T A

Recombination

Page 13: SNPs,  Haplotypes, Disease Associations

G T T C G A C T A T T A

G T T C G A C A A C A TA C G T A T C T A T T A

The two alleles are linked, I.e., they are “traveling together”

?

Recombinationdisrupts the linkage

Recombination

Page 14: SNPs,  Haplotypes, Disease Associations

Variations in Chromosomes Within a Population

Common Ancestor

Emergence of Variations Over Time

time present

Disease Mutation

Linkage Disequilibrium (LD)

Page 15: SNPs,  Haplotypes, Disease Associations

Time = present

2,000 gens. ago

Disease-Causing Mutation

1,000 gens. ago

Extent of Linkage Disequilibrium

Page 16: SNPs,  Haplotypes, Disease Associations

A Data Compression Problem

The Minimum Informative Subset

Page 17: SNPs,  Haplotypes, Disease Associations

A Data Compression Problem Select SNPs to use in an association study

Would like to associate single nucleotide polymorphisms (SNPs) with disease.

Very large number of candidate SNPs Chromosome wide studies, whole genome-scans For cost effectiveness, select only a subset.

Closely spaced SNPs are highly correlated It is less likely that there has been a recombination between two

SNPs if they are close to each other.

Page 18: SNPs,  Haplotypes, Disease Associations

Disease Associations

Page 19: SNPs,  Haplotypes, Disease Associations

Association studies

DiseaseResponder

ControlNon-responder

Allele 0 Allele 1

Marker A is associated with

Phenotype

Marker A:

Allele 0 =

Allele 1 =

Page 20: SNPs,  Haplotypes, Disease Associations

Evaluate whether nucleotide polymorphisms associate with phenotype

T A GA A

C G GA A

C G TA A

T A TC G

T G TA G

T G GA G

Association studies

Page 21: SNPs,  Haplotypes, Disease Associations

T A GA A

C G GA A

C G TA A

T A TC G

T G TA G

T G GA G

Association studies

Page 22: SNPs,  Haplotypes, Disease Associations

1 1 00 0

0 0 00 0

0 0 10 0

1 1 11 1

1 0 10 1

1 0 00 1

Association studies

Page 23: SNPs,  Haplotypes, Disease Associations

Compression based on Haplotype Resolution

Page 24: SNPs,  Haplotypes, Disease Associations

0 1 01 1

1 0 00

0 0 10 1

1

For a SNP s we associate a bipaprtite graph.

Nodes: the set of haplotypes.

Edges: the set of pairs of haplotypes with different alleles at s.

s1

s2

D-graph of a SNP

Page 25: SNPs,  Haplotypes, Disease Associations

0 1 01 1

1 0 00

0 0 10 1

1

For a set of SNPs S we associate a bipaprtite graph.

Nodes: the set of haplotypes.

Edges: the set of pairs of haplotypes with different

alleles at some SNP s in S.

s1

s2

D-graph of a set of SNPs

Page 26: SNPs,  Haplotypes, Disease Associations

0 1 01 1

1 0 00

0 0 10 1

1

Red SNP is equivalent to Blue SNP

SNP Selection

Page 27: SNPs,  Haplotypes, Disease Associations

Red SNPs predict Green SNP

0 1 01 1

1 0 00

0 0 10 1

1

SNP Selection

Page 28: SNPs,  Haplotypes, Disease Associations

Minimal Informative Subset

0 1 01 1

1 0 00

0 0 10 1

1

Data Compression

Page 29: SNPs,  Haplotypes, Disease Associations

Compresssion based on Haplotype Blocks

Page 30: SNPs,  Haplotypes, Disease Associations

Hypothesis – Haplotype Blocks?

The genome consists largely of blocks of

common SNPs with relatively little recombination

within the blocks Patil et al., Science, 2001; Jeffreys et al., Nature Genetics, 2001; Daly et al., Nature Genetics, 2001

Page 31: SNPs,  Haplotypes, Disease Associations

Sense genes

Antisense genes

200 kb

1 2 3 4

DNA

SNPs

Haplotypeblocks

Haplotype Block StructureLD-Blocks, and 4-Gamete Test Blocks

Page 32: SNPs,  Haplotypes, Disease Associations

Hudson and Kaplan 1985

A segment of SNPs is a block if between every pair of SNPs at most 3 out of the 4 gametes (00, 01,10,11) are observed.

0 0 10 1 11 1 01 1 1

0 0 10 1 11 1 01 0 1

BLOCK VIOLATES THE BLOCK DEFINITION

Four Gamete Block Test

Page 33: SNPs,  Haplotypes, Disease Associations

Finding Recombination Hotspots:Many Possible Partitions into Blocks

A C T A G A T A G C C TG T T C G A C A A C A TA C T C T A T G A T C GG T T A T A C G A C A TA C T C T A T A G T A TA C T A G C T G G C A T

All four gametes are present:

Page 34: SNPs,  Haplotypes, Disease Associations

A C T A G A T A G C C TG T T C G A C A A C A TA C T C T A T G A T C GG T T A T A C G A C A TA C T C T A T A G T A TA C T A G C T G G C A T

Find the left-most right endpoint of any constraint and mark the site

before it a recombination site.

Eliminate any constraints crossing that site.

Repeat until all constraints are gone.

The final result is a minimum-size set of sites crossing all constraints.

Page 35: SNPs,  Haplotypes, Disease Associations

Data Compression

ACGATCGATCATGAT

GGTGATTGCATCGAT

ACGATCGGGCTTCCG

ACGATCGGCATCCCG

GGTGATTATCATGAT

A------A---TG--

G------G---CG--

A------G---TC--

A------G---CC--

G------A---TG--

Haplotype Blocks based on LD(Method of Gabriel et al.2002)

Selecting Tagging SNPs in blocks

Page 36: SNPs,  Haplotypes, Disease Associations

A New Measure

Informativeness

Page 37: SNPs,  Haplotypes, Disease Associations

Informativeness

0 1 00 1

0 1 10 0

s

h2

h1

Page 38: SNPs,  Haplotypes, Disease Associations

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1 s2 s3 s4 s5

I(s1,s2) = 2/4 = 1/2

Informativeness

Page 39: SNPs,  Haplotypes, Disease Associations

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1 s2 s3 s4 s5

I({s1,s2}, s4) = 3/4

Informativeness

Page 40: SNPs,  Haplotypes, Disease Associations

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1 s2 s3 s4 s5

I({s3,s4},{s1,s2,s5}) = 3

S={s3,s4} is a

Minimal Informative Subset

Informativeness

Page 41: SNPs,  Haplotypes, Disease Associations

Minimum Set Cover= Minimum Informative Subset

s1

s2

s5

s3

s4

e1

e2

e3

e4

e5

e6

SNPs Edges

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1

s2

s3

s4

s5

Graph theory insight

Informativeness

Page 42: SNPs,  Haplotypes, Disease Associations

Minimum Set Cover {s3, s4}= Minimum Informative Subset

s1

s2

s5

s3

s4

e1

e2

e3

e4

e5

e6

SNPs Edges

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1

s2

s3

s4

s5

Informativeness

Graph theory insight

Page 43: SNPs,  Haplotypes, Disease Associations

Real Haplotype Data

Two different runs of the Gabriel el al Block Detection method +

Zhang et al SNP selection algorithm

Our block-free algorithm

A region of Chr. 22

45 Caucasian samples

Page 44: SNPs,  Haplotypes, Disease Associations

When Maximum Likelihood = Bayesian = Parsimony

A C G T

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

101112131415

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789101112131415

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314