Upload
dolf
View
30
Download
0
Embed Size (px)
DESCRIPTION
Application of Probabilistic ILP II, FP6-508861 www.aprill.org. Constrained Hidden Markov Models for Population-based Haplotyping. Niels Landwehr Joint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila University of Freiburg / University of Helsinki. - PowerPoint PPT Presentation
Citation preview
PM
SB
-06,
Tuu
sula
, Fin
land
Constrained Hidden Markov Models for Population-based
Haplotyping
Niels LandwehrJoint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila
University of Freiburg / University of Helsinki
Application of Probabilistic ILP II, FP6-508861 www.aprill.org
PM
SB
-06,
Tuu
sula
, Fin
land
Outline
• Population-based haplotype reconstruction – Infer haplotypes from genotypes: reconstruct hidden phase of
genetic data– Important problem in biology/medicine: e.g. disease association
studies
• An approach using constrained HMMs – Sparse markov chains to represent conserved haplotype
fragments– HMM model that can be learned directly from genotype data
• Experimental results
PM
SB
-06,
Tuu
sula
, Fin
land
Human Genome and SNPs
DNA Sequence
...GATATTCGTACGGATGTTTCCA...
...GATGTTCGTACTGATGTCTCCA...
...GATATTCGTACGGATGTTTCCA...
...GATATTCGTACGGATGTTTCCA...
...GATGTTCGTACTGATGTCTCCA...
...GATGTTCGTACTGATGTCTCCA...
SNP
(mar
ker)
SNP
(mar
ker)
Indi
vidu
als
1 2
3
4
5 6
SNP
(mar
ker)
PM
SB
-06,
Tuu
sula
, Fin
land
Haplotypes
DNA Sequence
A G T G T C A G T A G T G T C G T C
SNP
SNP
Indi
vidu
als
1 2
3
4
5 6
SNP
AGTGTCAGTAGTGTCGTC
Haplotypes
PM
SB
-06,
Tuu
sula
, Fin
land
Haplotypes
DNA Sequence
1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 0 1 0
SNP
SNP
Indi
vidu
als
1 2
3
4
5 6
SNP
101010101101010010
Haplotypes
PM
SB
-06,
Tuu
sula
, Fin
land
• Haplotypes– define our genetic individuality– contribute to risk factors of complex diseases (e.g., diabetes)
• Disease Association Studies (Gene Mapping): – find genetic difference between a case and a control population– Identifying SNPs responsible for disease might help find a cure
• Also useful for– Linkage disequilibrium studies: Summarize genetic variation– Understanding evolution of human populations
Why Haplotypes?
PM
SB
-06,
Tuu
sula
, Fin
land
.1...1...0...0...1.
.0...0...0...1...1.Paternal Maternal
The problem: Haplotypes not directly observable
{0,1}
{0,1}
{0}
{0,1}
{1}
• WetLab: only genotype information (two alleles for each SNP, but chromosome origin is unknown)
PM
SB
-06,
Tuu
sula
, Fin
land
1 01 00 11 10 11 10 01 1
Population-based Haplotype Reconstruction
{0,1}{0,1}{0,1}{1}{0,1}{1}{0}{1}
0 10 11 11 11 11 10 01 1
{0,1}{0,1}{1}{1}{1}{1}{0}{1}
1 01 00 11 10 11 10 01 0
{0,1}{0,1}{0,1}{1}{0,1}{1}{0}{0,1}
haplotypepair genotype
Individual 1 Individual 2 Individual 3 …
• Given the genotypes of several individuals, infer for every individual the most likely underlying haplotype pair
• Hidden data reconstruction problem using probabilistic model: exploit patterns in the haplotypes (linkage disequilibrium)
PM
SB
-06,
Tuu
sula
, Fin
land
Input: A set G of genotypes
€
g∈ {{0},{1},{0,1}}l
Output: A set H of corresponding haplotype pairs
€
(h1,h2)∈ {0,1}l ×{0,1}l
such that
€
G = {h1[1],h2[1]},...,{h1[l],h2[l]} : (h1,h2)∈ H{ }
Haplotype Reconstruction Problem (CS Perspective)
PM
SB
-06,
Tuu
sula
, Fin
land
• Given a model M for the distribution of haplotypes, can infer most likely resolution:
€
h1,h2 = argmaxh1 ,h2
P(h1,h2 g,M)
€
= argmax(h1 ,h2 )∈match(g )
P(h1,h2 M)
€
= argmax(h1 ,h2 )∈match(g )
P(h1 M)P(h2 M)
Population-based Haplotype Reconstruction
€
P(h M)
• Need to estimate this model from available genotype data
Hardy-Weinberg equilibrium
PM
SB
-06,
Tuu
sula
, Fin
land
• Competitive application domain for several years: many systems developed
– characterized by the statistical model and learning/reconstruction algorithms employed
• Special-purpose statistical models– Approximate Coalescent (PHASE 2001,2003,2005)– Block-based (Gerbil 2004,2005) – Variable-length MC (HaploRec 2004,2006) – Founder-based (HIT 2005)– Local clusters (fastPHASE 2006)
Prior Work on Haplotype Reconstruction
PM
SB
-06,
Tuu
sula
, Fin
land
• Special-purpose learning/reconstruction algorithms – MCMC variant – Approximate EM + partition ligation – …
• Our approach: – Model haplotypes using (sparse) markov chains– Natural extension to a Hidden Markov Model on genotypes– Directly learnable from genotype data (standard Baum-Welsh)
Prior Work on Haplotype Reconstruction
PM
SB
-06,
Tuu
sula
, Fin
land
• Modeling haplotypes
– Standard markov chain
– More general: order k markov chain
Constrained HMMs for haplotyping
€
P(h) = P(h[1]) P(h[t]h[t −1])t=1
n
∏
€
P(h) = P(h[t]h[t − k],...,h[t −1])t=1
n
∏
QuickTime™ and aNone decompressor
are needed to see this picture.
Path for haplotype 0,1,1,0
PM
SB
-06,
Tuu
sula
, Fin
land
• Modeling genotypes– Hidden phase (order of pair): Hidden Markov Model– States: pairs of states of the underlying markov chain
(state of the maternal/paternal sequence)– Output symbol: unordered pair– Path in the model: sample two haplotypes, output
corresponding genotype• Have to enforce Hardy-Weinberg equilibrium
– Parameter tying constraints on transition probabilities• Algorithms
– Learning: standard Baum-Welsh– Reconstruction of most likely haplotype pair: Viterbi
Constrained HMMs for haplotyping
PM
SB
-06,
Tuu
sula
, Fin
land
Constrained HMMs for haplotyping
QuickTime™ and aNone decompressor
are needed to see this picture.
• Example: paths for genotype {0,1},{1},{0,1},{0}
PM
SB
-06,
Tuu
sula
, Fin
land
Sparse Markov Modeling (SpaMM)
€
2k• Higher-order models (long history) needed: exponential size of model• However, out of the possible history blocks, only few occur in
data (conserved fragments) • Idea: Sparse model, iterative structure learning algorithm to identify
conserved fragments (Apriori-style)
Initialize first-order-model() em-training( )
repeat
regularize-and-extend( ) em-training( )
until
€
i :=1
€
λi :=
€
i := i +1
€
λi :=
€
λi−1
€
λi :=
€
λi
€
i = k
€
λi :=
€
λi
PM
SB
-06,
Tuu
sula
, Fin
land
SpaMM Model (order 1)
• Iteration: extend order of model by 1, prune unlikely parts• Avoids combinatorial explosion of model size
QuickTime™ and aNone decompressor
are needed to see this picture.
• Initial model: standard markov chain of order 1
PM
SB
-06,
Tuu
sula
, Fin
land
SpaMM Model (order 2)
QuickTime™ and aNone decompressor
are needed to see this picture.
• Iteration: extend order of model by 1, prune unlikely paths• Avoids combinatorial explosion of model size
PM
SB
-06,
Tuu
sula
, Fin
land
SpaMM Model (order 3)
QuickTime™ and aNone decompressor
are needed to see this picture.
• Iteration: extend order of model by 1, prune unlikely paths• Avoids combinatorial explosion of model size
PM
SB
-06,
Tuu
sula
, Fin
land
SpaMM Model (order 4)
QuickTime™ and aNone decompressor
are needed to see this picture.
• Iteration: extend order of model by 1, prune unlikely paths• Avoids combinatorial explosion of model size
PM
SB
-06,
Tuu
sula
, Fin
land
SpaMM Model (order 5)
QuickTime™ and aNone decompressor
are needed to see this picture.
• Iteration: extend order of model by 1, prune unlikely paths• Avoids combinatorial explosion of model size
PM
SB
-06,
Tuu
sula
, Fin
land
SpaMM Model (order 6)
QuickTime™ and aNone decompressor
are needed to see this picture.
• Iteration: extend order of model by 1, prune unlikely paths• Avoids combinatorial explosion of model size
PM
SB
-06,
Tuu
sula
, Fin
land
SpaMM Model (final)
QuickTime™ and aNone decompressor
are needed to see this picture.
• Final model: Model structure encodes conserved fragments• Concise representation of all haplotypes with non-zero
probability
PM
SB
-06,
Tuu
sula
, Fin
land
Experimental Evaluation
• Real world population data– Correct haplotypes have been inferred from trios– Daly dataset: 103 SNP markers for 174 individuals– Yoruba population: 100 datasets, 500 SNP markers each, 60
individuals
• Problem Setting:– Given the set of genotypes, algorithm outputs most likely
haplotype pairs– Difference to real haplotype pairs is measured in switch
distance (# recombinations needed to transform pairs, normalized)
PM
SB
-06,
Tuu
sula
, Fin
land
Results: Haplotype Reconstruction
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
Yoruba-20 Yoruba-500 Daly
PHASE (2001,2003,2005)fastPHASE (2006)SpaMM HaploRec (2004,2006)HIT (2005)Gerbil (2004,2005)
• Many well-engineered systems– Smart priors, averaging over several random restarts of EM, ...– SpaMM: proof-of-concept implementation, not tuned
PM
SB
-06,
Tuu
sula
, Fin
land
Results: Haplotype Reconstruction
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
Yoruba-20 Yoruba-500 Daly
PHASE (2001,2003,2005)fastPHASE (2006)SpaMM HaploRec (2004,2006)HIT (2005)Gerbil (2004,2005)
• PHASE most accurate, then fastPHASE, then SpaMM– however, PHASE too slow for long maps– SpaMM beats fastPHASE without averaging– overall, competitive accuracy
PM
SB
-06,
Tuu
sula
, Fin
land
Results: Runtime
0
1
2
3
4
5
20 Markers 100 Markers 500 Markers
PHASE (2001,2003,2005)fastPHASE (2006)SpaMM HaploRec (2004,2006)HIT (2005)Gerbil (2004,2005)
• Runtime in seconds for phasing 100 markers (log. scale)• SpaMM scales linearly in #markers
– like fastPHASE, HaploRec, HIT– unlike PHASE, Gerbil
PM
SB
-06,
Tuu
sula
, Fin
land
Results: Genotype imputation
00,010,020,030,040,050,060,070,080,090,1
0,110,12
10% missing 20% missing 30% missing
fastPHASE (2006)SpaMM HIT (2005)Gerbil (2004,2005)
• Most haplotyping methods can also predict missing genotype values• for SpaMM, can be read off Viterbi path
PM
SB
-06,
Tuu
sula
, Fin
land
Results: Genotype imputation
00,010,020,030,040,050,060,070,080,090,1
0,110,12
10% missing 20% missing 30% missing
fastPHASE (2006)SpaMM HIT (2005)Gerbil (2004,2005)
• fastPHASE best known method– Again, SpaMM beats fastPHASE without averaging
PM
SB
-06,
Tuu
sula
, Fin
land
Conclusions
• SpaMM: new haplotyping method– sparse Markov chains to encode conserved haplotype
fragments– Constrained HMM for modeling genotypes– Apriori-style structure learning algorithm– Simple, accurate, interpretable output
• Future work– Accuracy can probably be improved using standard techniques
(EM random restarts, averaging, ...)