Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University

A Mathematically Rigorous Algorithm for

Haplotype PhasingApproaching the Long-Range Phasing

Problem using Variable Memory Markov Chains

Samuel Angelo Crisanto2015 Undergraduate Research Symposium

Brown University

The Issue at HandModern Genome Assembly techniques fragment

human DNA into pieces short enough to be sequenced by current technology

The known reads are then algorithmically reassembled using overlaps between the fragments

Humans are diploid: they have a gene from their mother and a gene from their father. This assembly process destroys information about which fragment came from which chromosome.

RelevanceAccurate knowledge of haplotype phasing is an

important next step in genetics

Autism strongly correlates with the age of the father, and not with the age of the mother.

Study of plant genetics is made more difficult because plants are polyploid

Diseases that occur because of multiple SNPs on the same strand, indistinguishable from some mutations in the mother and some mutations in the father

Formulating the Problem (1)Human genomes are differentiated only by a

vector of SNPs (single nucleotide polymorphisms) which is substantially smaller than the entire genome.

Infinite sites assumption: the genome is so large that the likelihood of an allele being triallelic or more is vanishingly small. Therefore, any SNP will only come in two versions.

Formulating the Problem (2)We can take a person’s vector of SNPs, and

map the more common allele to 0 and the less common allele ot 1 (major and minor allele frequency)

Alleles: Mapping: Genotype:ACACTTGCT 010100010

010220020ACAGGTGAT 010010000

Formulating the Problem (3)The input to the algorithm is an mxn matrix

of 0s, 1s, and 2s.

The output of the algorithm is a 2mxn matrix of haplotypes

Input: 0010120Output: 0010110

0010100

Inferring a “correct” PhasingThe goal of the algorithm is to produce the

biologically accurate phasing of the input

Several heuristics exist, but some assumptions must be made:Parsimony: phasings with fewer haplotypes

tend to be more accurate than phasings with more haplotypes, due to heredity

Haplotype “blocking”: Due to linkage disequilibrium, sequences of SNPs tend to be more likely

An Angle of AttackWe can algorithmically and biologically

determine these haplotype blocks and use the probability of a sequence occurring in a particular block to impute an ambiguous SNP (a maximum likelihood approach)

Any haplotypes which are not empirically observed are unlikely to occur, and any haplotypes which are observed very few times are likely due to sequencing error (a way to reduce the state space)

Variable MemorySome inferences can be made with local

information“The man used a leash to walk the __”“…leash to walk the ___”“…walk the ___”

Some inferences improve when you look further back“The Wright Brothers invented the _____”“…Brothers invented the _____”“…invented the _____”

Formalizing the ProblemWe use probability theory to quantify our

beliefs about what will happen nextWhat is the most likely “next thing” to happen?How likely is a particular chain of events?

Predictive AlgorithmsWe can observe a sample and calculate

empirical probabilitiesHow far back should we look in order to make

good predictions?

Solving the Phasing ProblemVariable-Memory Algorithms are well-suited

to the problem of long-range haplotype phasingNatural way to capture blocks of LD, which are

of variable length

Given a sequence with missing characters, what are the most probable missing characters

What is the most probable phasing of an

ambiguous genotype

Representing Variable MemoryWe can use a Probabilistic Finite Automaton

M = (Q, , , , )

Q --------------------- a finite set of states --------------------- a finite alphabet: Q Q --------- transition function: Q [0,1] ------ next symbol probability: Q [0,1] ----------- initial state probability

Variable Memory Data StructuresThe following can all represent a variable

memory model in equivalent ways

L-Order Markov ChainsProbabilistic Suffix AutomataPredictive Suffix Trees

Formalizing Long-Range PhasingQ = {0, 1}^n, n = 1…L

= {0, 1}

: Q Q ={0,1}^n {0,1}

: Q [0,1] = {0,1}^n {0,1} [0,1]

: Q [0,1] = {0,1}^n [0,1]

2-Order Markov Chain

Probabilistic Suffix Automata

Predictive Suffix Trees

ApplicationsWe can use these to

Generate strings Given the suffix of a string, use the transition

function and a random number generator to choose a character to append

Calculate the likelihood of a string Given a string, find the longest relevant suffix at

every position and calculate the probability that the subsequent character would occur next

Predict the next character of a string Given the suffix of a string, find the most likely

character that would follow

RemarksL-Order Markov Chains are impractical

because the number of states explodes for anything but small numbers of L

Probabilistic Suffix Automata are the most compact way to represent variable order transition probabilities

Predictive Suffix Trees are easy to “learn” with an algorithm

Further RemarksWe can use techniques from the analysis of

Markov Chains to write proofs for what is possible for a PSA in general

If a PST “learns” the distribution that respects some PSA, then their equivalence implies that the property that we proved for the PSA holds for the PST

Rigorous proofs are carried out on the underlying PSA, and applications rely on learning the equivalent PST

Some Loose Ends“Sufficiently Similar” implies a notion of distance

Kullback-Liebler Divergence

Statistically significant difference Fisher’s exact test Pearson’s chi squared test

Top-down vs. Bottom upTop-down implementations are potentially as inefficient as

constructing an L-order Markov Chain, but more space efficient in the long run

Can be avoided by only considering strings that occur in sufficient number – nodes populated by a cursory read of the input string

Future GoalsA functional implementation of a variant of this algorithm

that learns a PST with an epsilon-close approximation to the PSA that generates a long-range haplotype phasing

More thorough exploration of the consequences of “merging nodes” found in Browning and Browning

A comparison of the results yielded by a PST with pruning with that of a PST with merging, as found in the Browning & Browning long-range phasing algorithm

An extension of the algorithm to include other phasing desiderata, such as parsimony and IBD

Citations Ron, Dana, Yoram Singer, and Naftali Tishby. "The

power of amnesia: Learning probabilistic automata with variable memory length." Machine learning 25.2-3 (1996): 117-149.

Browning, Brian L., and Sharon R. Browning. "Efficient multilocus association testing for whole genome association studies using localized haplotype clustering." Genetic epidemiology 31.5 (2007): 365-375.

Documents

Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University