Upload
hortense-king
View
218
Download
1
Embed Size (px)
Citation preview
A Mathematically Rigorous Algorithm for
Haplotype PhasingApproaching the Long-Range Phasing
Problem using Variable Memory Markov Chains
Samuel Angelo Crisanto2015 Undergraduate Research Symposium
Brown University
The Issue at HandModern Genome Assembly techniques fragment
human DNA into pieces short enough to be sequenced by current technology
The known reads are then algorithmically reassembled using overlaps between the fragments
Humans are diploid: they have a gene from their mother and a gene from their father. This assembly process destroys information about which fragment came from which chromosome.
RelevanceAccurate knowledge of haplotype phasing is an
important next step in genetics
Autism strongly correlates with the age of the father, and not with the age of the mother.
Study of plant genetics is made more difficult because plants are polyploid
Diseases that occur because of multiple SNPs on the same strand, indistinguishable from some mutations in the mother and some mutations in the father
Formulating the Problem (1)Human genomes are differentiated only by a
vector of SNPs (single nucleotide polymorphisms) which is substantially smaller than the entire genome.
Infinite sites assumption: the genome is so large that the likelihood of an allele being triallelic or more is vanishingly small. Therefore, any SNP will only come in two versions.
Formulating the Problem (2)We can take a person’s vector of SNPs, and
map the more common allele to 0 and the less common allele ot 1 (major and minor allele frequency)
Alleles: Mapping: Genotype:ACACTTGCT 010100010
010220020ACAGGTGAT 010010000
Formulating the Problem (3)The input to the algorithm is an mxn matrix
of 0s, 1s, and 2s.
The output of the algorithm is a 2mxn matrix of haplotypes
Input: 0010120Output: 0010110
0010100
Inferring a “correct” PhasingThe goal of the algorithm is to produce the
biologically accurate phasing of the input
Several heuristics exist, but some assumptions must be made:Parsimony: phasings with fewer haplotypes
tend to be more accurate than phasings with more haplotypes, due to heredity
Haplotype “blocking”: Due to linkage disequilibrium, sequences of SNPs tend to be more likely
An Angle of AttackWe can algorithmically and biologically
determine these haplotype blocks and use the probability of a sequence occurring in a particular block to impute an ambiguous SNP (a maximum likelihood approach)
Any haplotypes which are not empirically observed are unlikely to occur, and any haplotypes which are observed very few times are likely due to sequencing error (a way to reduce the state space)
Variable MemorySome inferences can be made with local
information“The man used a leash to walk the __”“…leash to walk the ___”“…walk the ___”
Some inferences improve when you look further back“The Wright Brothers invented the _____”“…Brothers invented the _____”“…invented the _____”
Formalizing the ProblemWe use probability theory to quantify our
beliefs about what will happen nextWhat is the most likely “next thing” to happen?How likely is a particular chain of events?
Predictive AlgorithmsWe can observe a sample and calculate
empirical probabilitiesHow far back should we look in order to make
good predictions?
Solving the Phasing ProblemVariable-Memory Algorithms are well-suited
to the problem of long-range haplotype phasingNatural way to capture blocks of LD, which are
of variable length
Given a sequence with missing characters, what are the most probable missing characters
What is the most probable phasing of an
ambiguous genotype
Representing Variable MemoryWe can use a Probabilistic Finite Automaton
M = (Q, , , , )
Q --------------------- a finite set of states --------------------- a finite alphabet: Q Q --------- transition function: Q [0,1] ------ next symbol probability: Q [0,1] ----------- initial state probability
Variable Memory Data StructuresThe following can all represent a variable
memory model in equivalent ways
L-Order Markov ChainsProbabilistic Suffix AutomataPredictive Suffix Trees
Formalizing Long-Range PhasingQ = {0, 1}^n, n = 1…L
= {0, 1}
: Q Q ={0,1}^n {0,1}
: Q [0,1] = {0,1}^n {0,1} [0,1]
: Q [0,1] = {0,1}^n [0,1]
2-Order Markov Chain
Probabilistic Suffix Automata
Predictive Suffix Trees
ApplicationsWe can use these to
Generate strings Given the suffix of a string, use the transition
function and a random number generator to choose a character to append
Calculate the likelihood of a string Given a string, find the longest relevant suffix at
every position and calculate the probability that the subsequent character would occur next
Predict the next character of a string Given the suffix of a string, find the most likely
character that would follow
RemarksL-Order Markov Chains are impractical
because the number of states explodes for anything but small numbers of L
Probabilistic Suffix Automata are the most compact way to represent variable order transition probabilities
Predictive Suffix Trees are easy to “learn” with an algorithm
Further RemarksWe can use techniques from the analysis of
Markov Chains to write proofs for what is possible for a PSA in general
If a PST “learns” the distribution that respects some PSA, then their equivalence implies that the property that we proved for the PSA holds for the PST
Rigorous proofs are carried out on the underlying PSA, and applications rely on learning the equivalent PST
Some Loose Ends“Sufficiently Similar” implies a notion of distance
Kullback-Liebler Divergence
Statistically significant difference Fisher’s exact test Pearson’s chi squared test
Top-down vs. Bottom upTop-down implementations are potentially as inefficient as
constructing an L-order Markov Chain, but more space efficient in the long run
Can be avoided by only considering strings that occur in sufficient number – nodes populated by a cursory read of the input string
Future GoalsA functional implementation of a variant of this algorithm
that learns a PST with an epsilon-close approximation to the PSA that generates a long-range haplotype phasing
More thorough exploration of the consequences of “merging nodes” found in Browning and Browning
A comparison of the results yielded by a PST with pruning with that of a PST with merging, as found in the Browning & Browning long-range phasing algorithm
An extension of the algorithm to include other phasing desiderata, such as parsimony and IBD
Citations Ron, Dana, Yoram Singer, and Naftali Tishby. "The
power of amnesia: Learning probabilistic automata with variable memory length." Machine learning 25.2-3 (1996): 117-149.
Browning, Brian L., and Sharon R. Browning. "Efficient multilocus association testing for whole genome association studies using localized haplotype clustering." Genetic epidemiology 31.5 (2007): 365-375.