Upload
cady
View
87
Download
0
Tags:
Embed Size (px)
DESCRIPTION
SNPs, Haplotypes, Disease Associations. Algorithmic Foundations of Computational Biology II Course 1. Prof. Sorin Istrail. SNPs and the Human Genome: The Minimal Informative Subset. Overview. Introduction: SNPs, Haplotypes A Data Compression Problem: - PowerPoint PPT Presentation
Citation preview
SNPs, Haplotypes,DiseaseAssociations
Algorithmic Foundations of Computational Biology II
Course 1
Prof. Sorin Istrail
SNPs and the Human Genome:The Minimal Informative Subset
Overview
Introduction:
SNPs, Haplotypes A Data Compression Problem:
The Minimum Informative Subset A New Measure:
Informativeness
A Most Challenging Problem
“None of the [advances of the 20th century medicine] depend on a deep knowledge of cellular processes or on any discoveries of molecular biology.
Cancer is still treated by gross physical and chemical assaults on the offending tissue.
Cardiovascular Disease is treated by surgery whose anatomical bases go back to the 19th century …Of course, intimate knowledge of the living cell and of basic molecular processes may be usefuleventually.”
Lewontin (1991)
Now
“A decade later, molecular biology can claim very few successes for drugs in clinical use that were designed ab initioto control a specific component of a pathwaylinked to disease: these include themonoclonal antibody Herceptin, and the kinase inhibitor Gleevec.”
Reik, Gregory and Urnov (2002)
Introduction
SNPs, HAPLOTYPES
A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%.
GATTTAGATCGCGATAGAGGATTTAGATCTCGATAGAG
The most abundant type of polymorphism
The two alleles at the site are G and T
Single Nucleotide Polymorphism (SNP)
tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca
tc
ga
ga
ga
ga
ga
gc
gc
gc
tc
ga
ga
ga
ga
ga
tc
tc
tc
tc
ga
ga
ga
tc
gc
tc
tc
tc
Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes.
Two individuals are 99.9% the same. I.e. differ in ~ 3 M basepairs.
SNPs occur once every ~600 bp
Average gene in the human
genome spans ~27Kb
~50 SNPs per gene
G C T C G A C A A C A GG T T C G T C A A C A G
Two individuals
C A G HaplotypesT T G
SNP SNP SNP
Haplotype
Mutations
Infinite Sites Assumption:
Each site mutates at most once
Haplotype Pattern
0 0 0 01 1 0 10 0 1 00 1 0 1
C A G TT T G AC A T GC T G T
At each SNP site label the two alleles as 0 and 1.
The choice which allele is 0 and which one is 1
is arbitrary.
G T T C G A C T A T T A
G T T C G A C A A C A TA C G T A T C T A T T A
Recombination
G T T C G A C T A T T A
G T T C G A C A A C A TA C G T A T C T A T T A
The two alleles are linked, I.e., they are “traveling together”
?
Recombinationdisrupts the linkage
Recombination
Variations in Chromosomes Within a Population
Common Ancestor
Emergence of Variations Over Time
time present
Disease Mutation
Linkage Disequilibrium (LD)
Time = present
2,000 gens. ago
Disease-Causing Mutation
1,000 gens. ago
Extent of Linkage Disequilibrium
A Data Compression Problem
The Minimum Informative Subset
A Data Compression Problem Select SNPs to use in an association study
Would like to associate single nucleotide polymorphisms (SNPs) with disease.
Very large number of candidate SNPs Chromosome wide studies, whole genome-scans For cost effectiveness, select only a subset.
Closely spaced SNPs are highly correlated It is less likely that there has been a recombination between two
SNPs if they are close to each other.
Disease Associations
Association studies
DiseaseResponder
ControlNon-responder
Allele 0 Allele 1
Marker A is associated with
Phenotype
Marker A:
Allele 0 =
Allele 1 =
Evaluate whether nucleotide polymorphisms associate with phenotype
T A GA A
C G GA A
C G TA A
T A TC G
T G TA G
T G GA G
Association studies
T A GA A
C G GA A
C G TA A
T A TC G
T G TA G
T G GA G
Association studies
1 1 00 0
0 0 00 0
0 0 10 0
1 1 11 1
1 0 10 1
1 0 00 1
Association studies
Compression based on Haplotype Resolution
0 1 01 1
1 0 00
0 0 10 1
1
For a SNP s we associate a bipaprtite graph.
Nodes: the set of haplotypes.
Edges: the set of pairs of haplotypes with different alleles at s.
s1
s2
D-graph of a SNP
0 1 01 1
1 0 00
0 0 10 1
1
For a set of SNPs S we associate a bipaprtite graph.
Nodes: the set of haplotypes.
Edges: the set of pairs of haplotypes with different
alleles at some SNP s in S.
s1
s2
D-graph of a set of SNPs
0 1 01 1
1 0 00
0 0 10 1
1
Red SNP is equivalent to Blue SNP
SNP Selection
Red SNPs predict Green SNP
0 1 01 1
1 0 00
0 0 10 1
1
SNP Selection
Minimal Informative Subset
0 1 01 1
1 0 00
0 0 10 1
1
Data Compression
Compresssion based on Haplotype Blocks
Hypothesis – Haplotype Blocks?
The genome consists largely of blocks of
common SNPs with relatively little recombination
within the blocks Patil et al., Science, 2001; Jeffreys et al., Nature Genetics, 2001; Daly et al., Nature Genetics, 2001
Sense genes
Antisense genes
200 kb
1 2 3 4
DNA
SNPs
Haplotypeblocks
Haplotype Block StructureLD-Blocks, and 4-Gamete Test Blocks
Hudson and Kaplan 1985
A segment of SNPs is a block if between every pair of SNPs at most 3 out of the 4 gametes (00, 01,10,11) are observed.
0 0 10 1 11 1 01 1 1
0 0 10 1 11 1 01 0 1
BLOCK VIOLATES THE BLOCK DEFINITION
Four Gamete Block Test
Finding Recombination Hotspots:Many Possible Partitions into Blocks
A C T A G A T A G C C TG T T C G A C A A C A TA C T C T A T G A T C GG T T A T A C G A C A TA C T C T A T A G T A TA C T A G C T G G C A T
All four gametes are present:
A C T A G A T A G C C TG T T C G A C A A C A TA C T C T A T G A T C GG T T A T A C G A C A TA C T C T A T A G T A TA C T A G C T G G C A T
Find the left-most right endpoint of any constraint and mark the site
before it a recombination site.
Eliminate any constraints crossing that site.
Repeat until all constraints are gone.
The final result is a minimum-size set of sites crossing all constraints.
Data Compression
ACGATCGATCATGAT
GGTGATTGCATCGAT
ACGATCGGGCTTCCG
ACGATCGGCATCCCG
GGTGATTATCATGAT
A------A---TG--
G------G---CG--
A------G---TC--
A------G---CC--
G------A---TG--
Haplotype Blocks based on LD(Method of Gabriel et al.2002)
Selecting Tagging SNPs in blocks
A New Measure
Informativeness
Informativeness
0 1 00 1
0 1 10 0
s
h2
h1
1 0 00 0
0 1 00 1
0 1 10 0
1 0 11 1
s1 s2 s3 s4 s5
I(s1,s2) = 2/4 = 1/2
Informativeness
1 0 00 0
0 1 00 1
0 1 10 0
1 0 11 1
s1 s2 s3 s4 s5
I({s1,s2}, s4) = 3/4
Informativeness
1 0 00 0
0 1 00 1
0 1 10 0
1 0 11 1
s1 s2 s3 s4 s5
I({s3,s4},{s1,s2,s5}) = 3
S={s3,s4} is a
Minimal Informative Subset
Informativeness
Minimum Set Cover= Minimum Informative Subset
s1
s2
s5
s3
s4
e1
e2
e3
e4
e5
e6
SNPs Edges
1 0 00 0
0 1 00 1
0 1 10 0
1 0 11 1
s1
s2
s3
s4
s5
Graph theory insight
Informativeness
Minimum Set Cover {s3, s4}= Minimum Informative Subset
s1
s2
s5
s3
s4
e1
e2
e3
e4
e5
e6
SNPs Edges
1 0 00 0
0 1 00 1
0 1 10 0
1 0 11 1
s1
s2
s3
s4
s5
Informativeness
Graph theory insight
Real Haplotype Data
Two different runs of the Gabriel el al Block Detection method +
Zhang et al SNP selection algorithm
Our block-free algorithm
A region of Chr. 22
45 Caucasian samples
When Maximum Likelihood = Bayesian = Parsimony
A C G T
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
123456789
101112131415
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
123456789101112131415
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
123456789
1011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
1234567891011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
123456789
1011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
1234567891011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
123456789
1011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
1234567891011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
123456789
1011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
1234567891011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
123456789
1011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
1234567891011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
123456789
1011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
1234567891011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
123456789
1011121314
1 2 3 4 56 7 8 9101112131415161718192021222324252627282930
1234567891011121314