# Hidden Markov Modelsâ€”Variants Conditional Random Fields 1 2 K 1 2 K 1 2 K 1 2 K x1x1 x2x2 x3x3 xKxK 2 1 K 2

• View
217

3

Embed Size (px)

### Text of Hidden Markov Modelsâ€”Variants Conditional Random Fields 1 2 K 1 2 K 1 2 K 1...

• Slide 1
• Hidden Markov ModelsVariants Conditional Random Fields 1 2 K 1 2 K 1 2 K 1 2 K x1x1 x2x2 x3x3 xKxK 2 1 K 2
• Slide 2
• CS262 Lecture 7, Win06, Batzoglou Two learning scenarios 1.Estimation when the right answer is known Examples: GIVEN:a genomic region x = x 1 x 1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN:the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls 2.Estimation when the right answer is unknown Examples: GIVEN:the porcupine genome; we dont know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we dont see when he changes dice QUESTION:Update the parameters of the model to maximize P(x| )
• Slide 3
• CS262 Lecture 7, Win06, Batzoglou 1.When the true parse is known Given x = x 1 x N for which the true = 1 N is known, Simply count up # of times each transition & emission is taken! Define: A kl = # times k l transition occurs in E k (b) = # times state k in emits b in x We can show that the maximum likelihood parameters (maximize P(x| )) are: A kl E k (b) a kl = e k (b) = i A ki c E k (c)
• Slide 4
• CS262 Lecture 7, Win06, Batzoglou 2. When the true parse is unknown Baum-Welch Algorithm expected Compute expected # of times each transition & is taken! Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: 1.Forward 2.Backward 3.Calculate A kl, E k (b), given CURRENT 4.Calculate new model parameters NEW : a kl, e k (b) 5.Calculate new log-likelihood P(x | NEW ) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x | ) does not change much
• Slide 5
• CS262 Lecture 7, Win06, Batzoglou Variants of HMMs
• Slide 6
• CS262 Lecture 7, Win06, Batzoglou Higher-order HMMs How do we model memory larger than one time point? P( i+1 = l | i = k)a kl P( i+1 = l | i = k, i -1 = j)a jkl A second order HMM with K states is equivalent to a first order HMM with K 2 states state Hstate T a HT (prev = H) a HT (prev = T) a TH (prev = H) a TH (prev = T) state HHstate HT state THstate TT a HHT a TTH a HTT a THH a THT a HTH
• Slide 7
• CS262 Lecture 7, Win06, Batzoglou Modeling the Duration of States Length distribution of region X: E[l X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions XY 1-p 1-q pq
• Slide 8
• CS262 Lecture 7, Win06, Batzoglou Example: exon lengths in genes
• Slide 9
• CS262 Lecture 7, Win06, Batzoglou Solution 1: Chain several states XY 1-p 1-q p q X X Disadvantage: Still very inflexible l X = C + geometric with mean 1/(1-p)
• Slide 10
• CS262 Lecture 7, Win06, Batzoglou Solution 2: Negative binomial distribution Duration in X: m turns, where During first m 1 turns, exactly n 1 arrows to next state are followed During m th turn, an arrow to next state is followed m 1 P(l X = m) = n 1 (1 p) n-1+1 p (m-1)-(n-1) = n 1 (1 p) n p m-n X (n) p X (2) X (1) p 1 p p Y 1 p
• Slide 11
• CS262 Lecture 7, Win06, Batzoglou Example: genes in prokaryotes EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3
• Slide 12
• CS262 Lecture 7, Win06, Batzoglou Solution 3:Duration modeling Upon entering a state: 1.Choose duration d, according to probability distribution 2.Generate d letters according to emission probs 3.Take a transition to next state according to transition probs Disadvantage: Increase in complexity of Viterbi: Time: O(D) Space: O(1) where D = maximum duration of state F d
• CS262 Lecture 7, Win06, Batzoglou Features that depend on many pos. in x What do we put in g(k, l, x, i)? The higher g(k, l, x, i), the more we like going from k to l at position i Richer models using this additional power Examples Casino player looks at previous 100 posns; if > 50 6s, he likes to go to Fair g(Loaded, Fair, x, i) += 1[x i-100, , x i-1 has > 50 6s] w DONT_GET_CAUGHT Genes are close to CpG islands; for any state k, g(k, exon, x, i) += 1[x i-1000, , x i+1000 has > 1/16 CpG] w CG_RICH_REGION x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9 ii i-1
• Slide 19
• CS262 Lecture 7, Win06, Batzoglou Features that depend on many pos. in x Conditional Random FieldsFeatures 1.Define a set of features that you think are important All features should be functions of current state, previous state, x, and position i Example: Old features: transition k l, emission b from state k Plus new features: prev 100 letters have 50 6s Number the features 1n: f 1 (k, l, x, i), , f n (k, l, x, i) features are indicator true/false variables Find appropriate weights w 1,, w n for when each feature is true weights are the parameters of the model 2.Lets assume for now each feature has a weight w j Then, g(k, l, x, i) = j=1n f j (k, l, x, i) w j x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9
• Slide 20
• CS262 Lecture 7, Win06, Batzoglou Features that depend on many pos. in x Define V k (i): Optimal score of parsing x 1 x i and ending in state k Then, assuming V k (i) is optimal for every k at position i, it follows that V l (i+1) = max k [V k (i) + g(k, l, x, i+1)] Why? Even though at posn i+1 we look at arbitrary positions in x, we are only affected by the choice of ending state k Therefore, Viterbi algorithm again finds optimal (highest scoring) parse for x 1 x N x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9
• Slide 21
• CS262 Lecture 7, Win06, Batzoglou Features that depend on many pos. in x Score of a parse depends on all of x at each position Can still do Viterbi because state i only looks at prev. state i-1 and the constant sequence x 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 HMM CRF
• Slide 22
• CS262 Lecture 7, Win06, Batzoglou How many parameters are there, in general? Arbitrarily many parameters! For example, let f j (k, l, x, i) depend on x i-5, x i-4, , x i+5 Then, we would have up to K | | 11 parameters! Advantage: powerful, expressive model Example: if there are more than 50 6s in the last 100 rolls, but in the surrounding 18 rolls there are at most 3 6s, this is evidence we are in Fair state Interpretation: casino player is afraid to be caught, so switches to Fair when he sees too many 6s Example: if there are any CG-rich regions in the vicinity (window of 2000 pos) then favor predicting lots of genes in this region Question: how do we train these parameters?
• Slide 23
• CS262 Lecture 7, Win06, Batzoglou Conditional Training Hidden Markov Model training: Given training sequence x, true parse Maximize P(x, ) Disadvantage: P(x, ) = P( | x) P(x) Quantity we care about so as to get a good parse Quantity we dont care so much about because x is always given
• Slide 24
• CS262 Lecture 7, Win06, Batzoglou Conditional Training P(x, ) = P( | x) P(x) P( | x) = P(x, ) / P(x) Recall F(j, x, ) = # times feature f j occurs in (x, ) = i=1N f j (k, l, x, i) ; count f j in x, In HMMs, lets denote by w j the weight of j th feature: w j = log(a kl ) or log(e k (b)) Then, HMM: P(x, ) = exp [ j=1n w j F(j, x, ) ] CRF:Score(x, ) = exp [ j=1n w j F(j, x, ) ]
• Slide 25
• CS262 Lecture 7, Win06, Batzoglou Conditional Training In HMMs, P( | x) = P(x, ) / P(x) P(x, ) = exp [ j=1n w j F(j, x, ) ] P(x) = exp [ j=1n w j F(j, x, ) ] =: Z Then, in CRF we can do the same to normalize Score(x, ) into a prob. P CRF ( | x) = exp [ j=1n w j F(j, x, ) ] / Z QUESTION: Why is this a probability???
• Slide 26
• CS262 Lecture 7, Win06, Batzoglou Conditional Training 1.We need to be given a set of sequences x and true parses 2.Calculate Z by a sum-of-paths algorithm similar to HMM We can then easily calculate P( | x) 3.Calculate partial derivative of P( | x) w.r.t. each parameter w j (not coveredakin to forward/backward) Update each parameter with gradient descent! 4.Continue until convergence to optimal set of weights P( | x) = exp [ j=1n w j F(j, x, ) ] / Zis convex!!!
• Slide 27
• CS262 Lecture 7, Win06, Batzoglou Conditional Random FieldsSummary 1.Ability to incorporate complicated non-local feature sets Do away with some independence assumptions of HMMs Parsing is still equally efficient 2.Conditional training Train parameters that are best for parsing, not modeling Need labeled examplessequences x and true parses (Can train on unlabeled sequences, however it is unreasonable to train too many parameters this way) Training is significantly slowermany iterations of forward/backward

Recommended ##### K.1 Introduction K-2 K.2 Computers K-3 K.3 The Intel 80x86 ... wilseypa/classes/eecs7095/... K.3 The
Documents ##### TRATAMIENTOS URBANأچSTICOS - ... A v. 2 I A v . 3 I K 1 I K 1 2 Bi s K 1 I B i s K 2 6 I K 2 2 8 I (T
Documents ##### Vitamin K-2 .contains a highly potent vitamin K-2 . Compared with vitamin K-1, K-2 has comparable
Documents ##### AnwendungderRischlieungsintegralmethode ...? k kâ€” k -k +b 1 k kâ€” kâ€” 4b 1 â€”T 2 a 3 Y n 2 I 1 2 1 { ( 2 ) 2 l 1 ( ) it x en nn neg) r n li kleinemSpannungsintensittsverhltnis
Documents ##### Automatic SUSY Calculations - .jF1j2 = Trf(k/1 me) (k/2 + me) gTrf(k/4 + mt) (k/3 mt) g = 1 2s 2
Documents ##### CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K 1 2 K 1 2 K 1 2 K x1x1 x2x2 x3x3 xKxK 2 1 K 2
Documents ##### K U P M 2 0 1 2 - zrss.si K U P M 2 0 1 2 K U P 1 2 M 2 0 1 2 Jolanda Radolli Srednja prometna إ،ola
Documents ##### F 1 G 1 K L ? J K H K < 1 1 M D B : 2 G B H > ? Kذ¬ G : P ... f 1 g 1 k l ? j k h k < 1 1 m d b : 2
Documents ##### 120'0W N 0 8 Bonthe District Map 1 5 . 9 k m This map is ... k m 1 3. 8 k m 3 2 . 2 k m 1 2 . 2 k m 2 9 . 5 k 6 . 5 k m 5. 7 k m 5. 2 k m 1 5 . 9 k m 16.7 k m 3. 1 k m 2. 5 k m 6 k
Documents ##### Karnaugh MAP (K-Map) Pokok Bahasan : 1. K-map 2 variabel 2. K
Documents ##### A - 25 - JICA 1 N 2 2 ak N 2 1 pn - km N 1 3 mt N 1 3 sm N 1 3 r N 1 2 tr-c K 2 K 1 K 2st-m K 2kn-st
Documents Documents