31
MSB-06, Tuusula, Finland Constrained Hidden Markov Models for Population- based Haplotyping Niels Landwehr Joint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila University of Freiburg / University of Helsinki Application of Probabilistic ILP II, FP6-508861 www . aprill . org

Constrained Hidden Markov Models for Population-based Haplotyping

  • Upload
    dolf

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Application of Probabilistic ILP II, FP6-508861 www.aprill.org. Constrained Hidden Markov Models for Population-based Haplotyping. Niels Landwehr Joint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila University of Freiburg / University of Helsinki. - PowerPoint PPT Presentation

Citation preview

Page 1: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Constrained Hidden Markov Models for Population-based

Haplotyping

Niels LandwehrJoint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila

University of Freiburg / University of Helsinki

Application of Probabilistic ILP II, FP6-508861 www.aprill.org

Page 2: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Outline

• Population-based haplotype reconstruction – Infer haplotypes from genotypes: reconstruct hidden phase of

genetic data– Important problem in biology/medicine: e.g. disease association

studies

• An approach using constrained HMMs – Sparse markov chains to represent conserved haplotype

fragments– HMM model that can be learned directly from genotype data

• Experimental results

Page 3: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Human Genome and SNPs

DNA Sequence

...GATATTCGTACGGATGTTTCCA...

...GATGTTCGTACTGATGTCTCCA...

...GATATTCGTACGGATGTTTCCA...

...GATATTCGTACGGATGTTTCCA...

...GATGTTCGTACTGATGTCTCCA...

...GATGTTCGTACTGATGTCTCCA...

SNP

(mar

ker)

SNP

(mar

ker)

Indi

vidu

als

1 2

3

4

5 6

SNP

(mar

ker)

Page 4: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Haplotypes

DNA Sequence

A G T G T C A G T A G T G T C G T C

SNP

SNP

Indi

vidu

als

1 2

3

4

5 6

SNP

AGTGTCAGTAGTGTCGTC

Haplotypes

Page 5: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Haplotypes

DNA Sequence

1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 0 1 0

SNP

SNP

Indi

vidu

als

1 2

3

4

5 6

SNP

101010101101010010

Haplotypes

Page 6: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

• Haplotypes– define our genetic individuality– contribute to risk factors of complex diseases (e.g., diabetes)

• Disease Association Studies (Gene Mapping): – find genetic difference between a case and a control population– Identifying SNPs responsible for disease might help find a cure

• Also useful for– Linkage disequilibrium studies: Summarize genetic variation– Understanding evolution of human populations

Why Haplotypes?

Page 7: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

.1...1...0...0...1.

.0...0...0...1...1.Paternal Maternal

The problem: Haplotypes not directly observable

{0,1}

{0,1}

{0}

{0,1}

{1}

• WetLab: only genotype information (two alleles for each SNP, but chromosome origin is unknown)

Page 8: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

1 01 00 11 10 11 10 01 1

Population-based Haplotype Reconstruction

{0,1}{0,1}{0,1}{1}{0,1}{1}{0}{1}

0 10 11 11 11 11 10 01 1

{0,1}{0,1}{1}{1}{1}{1}{0}{1}

1 01 00 11 10 11 10 01 0

{0,1}{0,1}{0,1}{1}{0,1}{1}{0}{0,1}

haplotypepair genotype

Individual 1 Individual 2 Individual 3 …

• Given the genotypes of several individuals, infer for every individual the most likely underlying haplotype pair

• Hidden data reconstruction problem using probabilistic model: exploit patterns in the haplotypes (linkage disequilibrium)

Page 9: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Input: A set G of genotypes

g∈ {{0},{1},{0,1}}l

Output: A set H of corresponding haplotype pairs

(h1,h2)∈ {0,1}l ×{0,1}l

such that

G = {h1[1],h2[1]},...,{h1[l],h2[l]} : (h1,h2)∈ H{ }

Haplotype Reconstruction Problem (CS Perspective)

Page 10: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

• Given a model M for the distribution of haplotypes, can infer most likely resolution:

h1,h2 = argmaxh1 ,h2

P(h1,h2 g,M)

= argmax(h1 ,h2 )∈match(g )

P(h1,h2 M)

= argmax(h1 ,h2 )∈match(g )

P(h1 M)P(h2 M)

Population-based Haplotype Reconstruction

P(h M)

• Need to estimate this model from available genotype data

Hardy-Weinberg equilibrium

Page 11: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

• Competitive application domain for several years: many systems developed

– characterized by the statistical model and learning/reconstruction algorithms employed

• Special-purpose statistical models– Approximate Coalescent (PHASE 2001,2003,2005)– Block-based (Gerbil 2004,2005) – Variable-length MC (HaploRec 2004,2006) – Founder-based (HIT 2005)– Local clusters (fastPHASE 2006)

Prior Work on Haplotype Reconstruction

Page 12: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

• Special-purpose learning/reconstruction algorithms – MCMC variant – Approximate EM + partition ligation – …

• Our approach: – Model haplotypes using (sparse) markov chains– Natural extension to a Hidden Markov Model on genotypes– Directly learnable from genotype data (standard Baum-Welsh)

Prior Work on Haplotype Reconstruction

Page 13: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

• Modeling haplotypes

– Standard markov chain

– More general: order k markov chain

Constrained HMMs for haplotyping

P(h) = P(h[1]) P(h[t]h[t −1])t=1

n

P(h) = P(h[t]h[t − k],...,h[t −1])t=1

n

QuickTime™ and aNone decompressor

are needed to see this picture.

Path for haplotype 0,1,1,0

Page 14: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

• Modeling genotypes– Hidden phase (order of pair): Hidden Markov Model– States: pairs of states of the underlying markov chain

(state of the maternal/paternal sequence)– Output symbol: unordered pair– Path in the model: sample two haplotypes, output

corresponding genotype• Have to enforce Hardy-Weinberg equilibrium

– Parameter tying constraints on transition probabilities• Algorithms

– Learning: standard Baum-Welsh– Reconstruction of most likely haplotype pair: Viterbi

Constrained HMMs for haplotyping

Page 15: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Constrained HMMs for haplotyping

QuickTime™ and aNone decompressor

are needed to see this picture.

• Example: paths for genotype {0,1},{1},{0,1},{0}

Page 16: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Sparse Markov Modeling (SpaMM)

2k• Higher-order models (long history) needed: exponential size of model• However, out of the possible history blocks, only few occur in

data (conserved fragments) • Idea: Sparse model, iterative structure learning algorithm to identify

conserved fragments (Apriori-style)

Initialize first-order-model() em-training( )

repeat

regularize-and-extend( ) em-training( )

until

i :=1

λi :=

i := i +1

λi :=

λi−1

λi :=

λi

i = k

λi :=

λi

Page 17: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

SpaMM Model (order 1)

• Iteration: extend order of model by 1, prune unlikely parts• Avoids combinatorial explosion of model size

QuickTime™ and aNone decompressor

are needed to see this picture.

• Initial model: standard markov chain of order 1

Page 18: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

SpaMM Model (order 2)

QuickTime™ and aNone decompressor

are needed to see this picture.

• Iteration: extend order of model by 1, prune unlikely paths• Avoids combinatorial explosion of model size

Page 19: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

SpaMM Model (order 3)

QuickTime™ and aNone decompressor

are needed to see this picture.

• Iteration: extend order of model by 1, prune unlikely paths• Avoids combinatorial explosion of model size

Page 20: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

SpaMM Model (order 4)

QuickTime™ and aNone decompressor

are needed to see this picture.

• Iteration: extend order of model by 1, prune unlikely paths• Avoids combinatorial explosion of model size

Page 21: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

SpaMM Model (order 5)

QuickTime™ and aNone decompressor

are needed to see this picture.

• Iteration: extend order of model by 1, prune unlikely paths• Avoids combinatorial explosion of model size

Page 22: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

SpaMM Model (order 6)

QuickTime™ and aNone decompressor

are needed to see this picture.

• Iteration: extend order of model by 1, prune unlikely paths• Avoids combinatorial explosion of model size

Page 23: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

SpaMM Model (final)

QuickTime™ and aNone decompressor

are needed to see this picture.

• Final model: Model structure encodes conserved fragments• Concise representation of all haplotypes with non-zero

probability

Page 24: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Experimental Evaluation

• Real world population data– Correct haplotypes have been inferred from trios– Daly dataset: 103 SNP markers for 174 individuals– Yoruba population: 100 datasets, 500 SNP markers each, 60

individuals

• Problem Setting:– Given the set of genotypes, algorithm outputs most likely

haplotype pairs– Difference to real haplotype pairs is measured in switch

distance (# recombinations needed to transform pairs, normalized)

Page 25: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Results: Haplotype Reconstruction

0

0,01

0,02

0,03

0,04

0,05

0,06

0,07

Yoruba-20 Yoruba-500 Daly

PHASE (2001,2003,2005)fastPHASE (2006)SpaMM HaploRec (2004,2006)HIT (2005)Gerbil (2004,2005)

• Many well-engineered systems– Smart priors, averaging over several random restarts of EM, ...– SpaMM: proof-of-concept implementation, not tuned

Page 26: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Results: Haplotype Reconstruction

0

0,01

0,02

0,03

0,04

0,05

0,06

0,07

Yoruba-20 Yoruba-500 Daly

PHASE (2001,2003,2005)fastPHASE (2006)SpaMM HaploRec (2004,2006)HIT (2005)Gerbil (2004,2005)

• PHASE most accurate, then fastPHASE, then SpaMM– however, PHASE too slow for long maps– SpaMM beats fastPHASE without averaging– overall, competitive accuracy

Page 27: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Results: Runtime

0

1

2

3

4

5

20 Markers 100 Markers 500 Markers

PHASE (2001,2003,2005)fastPHASE (2006)SpaMM HaploRec (2004,2006)HIT (2005)Gerbil (2004,2005)

• Runtime in seconds for phasing 100 markers (log. scale)• SpaMM scales linearly in #markers

– like fastPHASE, HaploRec, HIT– unlike PHASE, Gerbil

Page 28: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Results: Genotype imputation

00,010,020,030,040,050,060,070,080,090,1

0,110,12

10% missing 20% missing 30% missing

fastPHASE (2006)SpaMM HIT (2005)Gerbil (2004,2005)

• Most haplotyping methods can also predict missing genotype values• for SpaMM, can be read off Viterbi path

Page 29: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Results: Genotype imputation

00,010,020,030,040,050,060,070,080,090,1

0,110,12

10% missing 20% missing 30% missing

fastPHASE (2006)SpaMM HIT (2005)Gerbil (2004,2005)

• fastPHASE best known method– Again, SpaMM beats fastPHASE without averaging

Page 30: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Conclusions

• SpaMM: new haplotyping method– sparse Markov chains to encode conserved haplotype

fragments– Constrained HMM for modeling genotypes– Apriori-style structure learning algorithm– Simple, accurate, interpretable output

• Future work– Accuracy can probably be improved using standard techniques

(EM random restarts, averaging, ...)

Page 31: Constrained Hidden Markov Models for Population-based Haplotyping

PM

SB

-06,

Tuu

sula

, Fin

land

Thanks!