22
1 MARKOV MARKOV MODELS MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN HIDDEN

1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN

Embed Size (px)

Citation preview

1

MARKOVMARKOV

MODELSMODELS

Presentation

by

Jeff Rosenberg, Toru Sakamoto, Freeman Chen

HIDDENHIDDEN

Jeffrey L. Rosenberg
General questions:Do we want to have explicit places in the slides where we "introduce" the use of Mathematica? That is, make some kind of statement about what we are going to show using Mathematica, so that it is a natural place to switch to the (appropriate) notebook?Possibly we could even put in a link to the notebook itself, so that a button push in the presentation would bring up the notebook in Mathematica, and, when we are done (and, presumably, exit Mathematica), we would return to where we left off, ready to continue.Maybe one thing we need to do right away is determine just what Mathematica examples/code we are going to show, and where to put it in the presentation.
Jeffrey L. Rosenberg
Do we really want all the fancy graphics whiz-bang stuff (on this slide and on others)?I think a lean/clean approach is better.

2

The Plan

• Modeling Biological Sequences

• Markov Chains

• Hidden Markov Models

• Issues

• Examples

• Techniques and Algorithms

• Doing it with Mathematica

3

Biological Sequences

VVGGLVALRGAHPYIAALYWGHSFCAGSLIAPC

URT1_DESRO STGGLFTDITSHPWQAAIFAQNRRSSGERFLCGG

TRYP_PIG IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWV

COGS_UCAPU IVGGVEAVPNSWPHQAALFIDDMYFCGGSLISPEW

TRY1_RAT IVGGYTCPEHSVPYQVSLNSGYHFCGGSLINDQWV

TRY1_SALSA IVGGYECKAYSQTHQVSLNSGYHFCGGSLVNENWV

NRPN_MOUSE ILEGRECIPHSQPWQAALFQGERLICGGVLVGDRW

TRY1_BOVIN IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWV

FA12_HUMAN

4

Sequences and Models• Many biological sequences (DNA/RNA, proteins) have very

“subtle” rules for their structure; they clearly form “families” and are “related,” yet simple measures or descriptions of these relationships or rules rarely apply

• There is a need to create some kind of “model” that can be used to identify relationships among sequences and distinguish members of families from non-members

• Given the complexity and variability of these biological structures, any practical model must have a probabilistic component: that is, it will be a stochastic model, rather than a mechanistic one. It will be evaluated by the (statistical) accuracy and usefulness of its predictions, rather than the correspondence of its internal features to any corresponding internal mechanism in the structures being modeled.

5

Markov Chains

• A system with a set of m possible states, Si; at each of a sequence of discrete points in time t>=0, the system is in exactly one of those states; the state at time t >= 0 is designated by qt; the movement from qt to qt+1 is probabilistic, and depends only on the states of the system at or prior to t.

• An initial state distribution π(i) = Prob(q0 = Si)

• Process terminates either at time T or when reaching a designated final state Sf

6

Markov Chains of Order N

• Nth-order Markov chain (N >= 0): transition probabilities out of state qt depend only on the values of qt, qt-1,… qt-(N-1).

• Typically deal with 1st-order Markov chain, so only qt itself affects the transition probabilities.

• In a 1st-order chain, for each state Sj, there is set of m probabilities for selecting the next state to move to: ai,j = Prob(qt+1 = Si | qt = Sj) [1 <= i <= m, t >= 0]

• If there is some ordering of states such that ai,j = 0 whenever i < j (i.e., no “non-trivial” loops), then this is a “linear” (or “left-to-right”) Markov process

• “Homogeneous” Markov model: ai,j is independent of t

7

Simple Markov Models• Might use a Markov chain to model a sequence

where the symbol in position n depends on the symbol(s) in position(s) n-1,…n-N.

• For example, if a protein is more likely to have Lys after a sequence Arg-Cys, this could be encoded as (a small part of) a 2nd-order Markov model.

• If the probabilities of a given symbol are the same for all positions in the sequence, and independent of symbols in other positions, then can use the “degenerate” 0th-order Markov chain, where the probability of a given symbol is constant, regardless of the preceding symbol (or of the position in the sequence).

8

Hidden Markov Models (HMMs)• In a Markov chain (or model), the states are themselves visible;

they can be considered the outputs of the system (or deterministically associated with those outputs).

• However, if each state can emit (generate) any of several possible outputs (symbols) vk, from an output alphabet O of M symbols, on a probabilistic basis, then it is not possible (in general) to determine the sequence of the states themselves; they are “hidden.”

• Classic example: the “urn” game– A set of N urns (states), each containing various colored balls (output

symbols – total of M colors available), behind a curtain– Player 1 selects an urn at random (with Markov assumptions), then

picks a ball at random from that urn and announces its color to player 2– Player 1 then repeats the above process, a total of T times– Player 2 must determine sequence of urns selected based on the

sequence of colors announced

9

Additional Parameters for HMMs

• Now, in addition to the transition probabilities, each state has a prescribed probability distribution to emit or produce a symbol vk from O:

bi,k = Prob(vk | Si)• If qt = Si, then the generated output at time t is vk with

probability bi,k.• So, a HMM is “doubly stochastic” – both the

(hidden) state transition process and the (visible) output symbol generation process are probabilistic.

10

Bayesian Aspects of HMM Usage

• Given an HMM M, we can relatively easily calculate the probability of occurrence of an arbitrary output sequence, s: P(s | M)

• However, we often want to determine the underlying set of states, transition probabilities, etc. (the model) that is “most likely” to have produced the output sequence s that we have observed: P(M | s)

• Bayes’ Formula for sequence “recognition”: P(M | s) = P(s | M) P(M) / P(s)

• Very hard to find this “absolute” probability: depends on specific a priori probabilities we are unlikely to know)

• Instead, make it a “discrimination” problem: define a “null model” N, find P(M | s) / P(N | s)

11

Issues in Using HMMs• Model architecture/topology• Training

– Selecting an appropriate training set– Finding an “optimal” HMM that fits that set– Must avoid overfitting

• Scoring (for HMM construction, sequence recognition)– How likely that our sequence was generated by our HMM?– Versus some “null model” – this converts a very difficult

recognition problem into a tractable discrimination problem– Score is the “relative likelihood” for our HMM

• Efficiency of evaluation– Pruning the search: Dynamic programming– Using log-odds scores

12

A Simple HMM for Some DNA Sequences

State(Si)

Transition Probabilities (ai,j)

Emission Probabilities (bi,k)

13

HMMs and Multiple Alignments• Can convert a multiple alignment into an HMM:

– Create a node for each column in which most sequences have an aligned residue

– Columns with many missing letters go to Insert states– Emission probabilities are computed from the relative

frequencies in the alignment column (for Match states), usually with aid of a regularizer (to avoid zero-probability cases)

– Emission probabilities for Insert states are taken from background frequencies

• Can also create a multiple alignment from a linear HMM:– Find Viterbi (most likely) path in the HMM for each sequence– Each match state on that path creates a column in the alignment– Ignore or show in lower case letters from insert states– Setting transition probabilities is equivalent to setting gap

penalties in sequence alignment - more an art than a science.

14

Protein Sequences Alignment

ALYW-------GHSFCAGSLAIFAKHRRSPGERFLCGGILAIYRRHRG-GSVTYVCGGSLAIFAQNRRSSGERFLCGGILALFQGE------RLICGGVLALFIDD------MYFCGGSLAIYHYS------SFQCGGVLSLNS-------GSHFCGGSL

Three states

1. Match states

2. Insertion states

3. Deletion states

15

M1 M2 M3 M4

I4I3I2I1I0

D1 D3D2 D4

Topology of Profile HMM

M1 M2 M3 M4

I4I3I2I1I0

D1 D3D2 D4

Match states

Insertion statesDeletion states

Mi -> Mi+1

Mi -> Ii

Mi -> Di+1 Ii -> Mi+1

Ii -> Ii

Ii -> Di+1

Di -> Mi+1

Di -> Ii

Di -> Di+1

16

Regularizers• For avoiding overfitting to training set• Substitution matrices

– Identify more likely amino acid substitutions, reflecting biochemical similarities/differences

– Fixed for all positions in a sequence; one value for a given pair of amino acides

• Pseudocounts – For protein sequences, typically based on observed (relative)

frequencies of various amino acids– “Universal” frequencies or position/type dependent values

• Dirichlet mixtures– Probabilistic combinations of Dirichlet densities

• Densities over probability distributions: i.e., the probability density of various distributions of symbols (in a given sequence position)

• Used to generate data-dependent pseudocounts

17

Algorithms for HMM Tasks (1)• 3 Major Problems:

– Determine HMM parameters (given some HMM topology and a training set)

– Calculate (relative) likelihood of a given output sequence through a given HMM

– Find the optimal (most likely) path through a given HMM for a specific output sequence, and its (relative) likelihood

• Forward/Backward Algorithm– Used for determining the parameters of an HMM from

training set data– Calculates probability of going forward to a given state

(from initial state), and of generating final model state (member of training set) from that state

– Iteratively adjusts the model parameters

18

Algorithms for HMM Tasks (2)• Baum-Welch (Expectation-maximization, EM)Algorithm

– Often used to determine the HMM parameters– Can also determine most likely path for a (set of) output sequence(s)– Add up probabilities over all possible paths– Then re-update parameters and iterate– Cannot guarantee global optimum; very expensive

• Forward Algorithm– Calculates probability of a particular output sequence given the HMM– Straightforward summation of product of (partial path) probabilities

• Viterbi Algorithm– Classical dynamic programming algorithm– Choose “best” path (at each point), based on log-odds scores– Save results of “subsubproblems” and re-use them as part of higher-

level evaluations– More efficient than Baum-Welch

19

HMMs for Protein/Gene Sequence Analysis

• Using any of various means, identify a set of related sequences with conserved regions

• Make 1st-order Markov assumptions: transitions independent of sequence history and sequence content (other than at the substitution site itself)

• Construct a HMM based on the set of sequences• Use this HMM to search for additional members

of this family, possibly performing alignments– Search by comparing fit to HMM against fit to some

null model• For phylogenetic trees, also concerned with the

length of the paths involved and with shared intermediate states (sequences)

Jeffrey L. Rosenberg
I have a slightly different take on this:- I think that HMMs are (can be) used to determine the transition probabilities as well, via training; this is done based on a set of (I presume) sequences identified as being in a family by means of (for example) pairwise or multiple alignments done, for example, on the basis of some substitution matrix and programs such as BLAST, etc.- then the HMMs, having been "trained" to recognize members of the family, are used to search large databases for (additional) members of that family, and to created a multiple alignment among them
Jeffrey L. Rosenberg
Also need to say something about scoring, and about the use of null models to "normalize" the scores.

20

Very Simple Viterbi

min = -1099

DNA Sequence Alignment – with value of +1 for match, -1 for mismatch, -2 for “gap”; remember “best” at each step

A C T A0 min min min min

A min 1 -1 -3 -5T minG min A min

A C T A0 min min min min

A min 1 -1 -3 -5T min -1 0 0 -2G min -3 -2 -1 -1A min

A C T A0 min min min min

A min 1 -1 -3 -5T min -1 0 0 -2G min -3 -2 -1 -1A min -5 -4 -3 0

A C T A0 min min min min

A min 1 -1 -3 -5T min -1 0 0 -2G min A min

21

Very Simple Viterbi: Traceback

A C T A0 min min min min

A min 1 -1 -3 -5T min -1 0 0 -2G min -3 -2 -1 -1

A min -5 -4 -3 0

22

Onward to Mathematica!