Hidden Markov Models (HMM) · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) 1 2...

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

Hidden Markov Models (HMM)

Devise algorithms for computing the posteriors on a Markov Chain with hidden variables (HMM).

𝒙1 𝒙2 𝒙3 𝒙𝑇

Let’s assume we have discrete random variables (e.g., taking 3 discrete

values 𝒙𝑡 = {100,010,001})

𝑝(𝒙𝑡 𝒙1, . . , 𝒙𝑡−1 = 𝑝(𝒙𝑡|𝒙𝑡−1) Markov Property:

Stationary, Homogeneous or Time-Invariant if the distribution 𝑝 𝒙𝑡 𝒙𝑡−1 does not depend on 𝑡

e.g. 𝑝(𝒙𝑡 =100|𝒙𝑡−1 =

Markov Chains with Discrete Random Variables

𝑝(𝒙1, . . , 𝒙𝑇) = 𝑝(𝒙1) 𝑝(𝒙𝑡|𝒙𝑡−1)

𝑡=2

What do we need in order to describe the whole procedure?

(1) A probability for the first frame/timestamp etc 𝑝(𝒙1). In order to

define the probability we need to define the vector 𝝅 =(𝜋1, 𝜋2, … , 𝜋K)

(2) A transition probability 𝑝 𝒙𝑡|𝒙𝑡−1 . In order to define it we need a

𝐾𝑥𝐾 transition matrix 𝑨 = 𝑎𝑖𝑗

𝑝 𝒙1|𝝅 = 𝜋𝑐𝑥1𝒄

𝑐=1

𝑝 𝒙𝑡|𝒙𝑡−1, 𝑨 = 𝑎𝑗𝑘𝑥𝑡−1𝑗𝑥𝑡𝑘

𝑘=1

𝑗=1

Markov Chains with Discrete Random Variables

Ν(𝒙|𝜮1, 𝝁1)

𝛮(𝒙|𝜮2, 𝝁2)

Ν(𝒙|𝜮3, 𝝁3)

𝒛𝑡 =𝑧𝑡1𝑧𝑡2𝑧𝑡3

∈100,010,001

𝑝 𝑿, 𝒁 𝜃

= 𝑝 𝒙1, 𝒙2, ⋯ , 𝒙𝑇 , 𝒛1,𝒛2, ⋯ , 𝒛𝑇|𝜃

= 𝑝(𝒙𝑡|𝒛𝑡, 𝜃𝑥)

𝑡=1

𝑝(𝒛𝑡|𝜃𝑧)

𝑡=1

Latent Variables

Latent Variables in a Markov Chain

𝑝(𝒛1, . . , 𝒛𝑇) = 𝑝(𝒛1) 𝑝(𝒛𝑡|𝒛𝑡−1)

𝑡=2

𝑝 𝑿, 𝒁 𝜃

= 𝑝 𝒙1, 𝒙2, ⋯ , 𝒙𝑇 , 𝒛1,𝒛2, ⋯ , 𝒛𝑇|𝜃

= 𝑝(𝒙𝑡|𝒛𝑡, 𝜃𝑥)

𝑡=1

𝑝(𝒛1) 𝑝(𝒛𝑡|𝒛𝑡−1)

𝑡=2

𝒛𝑇

𝒙𝑇

Each die has 6 sides.

(our latent variable is probably the dice )

Hidden Markov Models

𝒛 = {10,01}

A casino has two dice

One is fair and one is not

𝒙 = {

100000

010000

001000

000100

000010

000001

(1) Fair die 𝑝 𝑥𝑗 = 1|𝑧1 = 1 =1

6 for all 𝑗 = {1, . . , 6}

(2) Loaded die 𝑝 𝑥𝑗 = 1|𝑧2 = 1 =1

10 for 𝑗 = 1, . . , 5 and

𝑝 𝑥6 = 6|𝑧2 = 1 =1

Emission probability

A casino player switches back & forth between fair and loaded die

once every 20 turns

(i.e., 𝑝 𝑧𝑡 1 = 1|𝑧𝑡−1 2 = 1 = 𝑝 𝑧𝑡 2 = 1|𝑧𝑡−1 1 = 1 = 0.05)

Given a string of observations and the above model:

(1) We want to find for a timestamp 𝑡 the probabilities of each die given

the observations that far.

This process is called Filtering: 𝑝 𝒛𝑡 𝒙1, 𝒙2, … , 𝒙𝑡

(2) We want to find for a timestamp 𝑡 the probabilities of each die given

the whole string.

This process is called Smoothing: 𝑝 𝒛𝑡 𝒙1, 𝒙2, … , 𝒙𝑇

664153216162115234653214356634261655234232315142464156663246

(3) Given the observation string find the string of hidden variables that

maximize the posterior.

argmax𝒚1… 𝒚𝑡 𝑝 𝒚1, 𝒚2, … , 𝒚𝑡 𝒙1, 𝒙2, … , 𝒙𝑡

This process is called Decoding (Viterbi).

(4) Find the probability of the model.

𝑝(𝒙1, 𝒙2, … , 𝒙𝑇)

This process is called Evaluation

664153216162115234653214356634261655234232315142464156663246

222222222222221111111222222222222221111111111111111111122222222

Filtering Smoothing Decoding

Taken from Machine Learning: A Probabilistic Perspective by K. Murphy

(5) Prediction

𝑝 𝒛𝑡+𝛿 𝒙1, 𝒙2, … , 𝒙𝑡 𝑝 𝒙𝑡+𝛿 𝒙1, 𝒙2, … , 𝒙𝑡

(6) Parameter estimation (Baum-Welch algorithm)

𝑨,𝝅, 𝜃

𝑝 𝒛𝑡 𝒙1, 𝒙2, … , 𝒙𝑡

𝑝 𝒛𝑡 𝒙1, 𝒙2, … , 𝒙𝑇

argmax𝒚1… 𝒚𝑡 𝑝 𝒚1, … , 𝒚𝑡 𝒙𝟏, … , 𝒙𝒕

𝑝 𝒛𝑡−𝜏 𝒙1, 𝒙2, … , 𝒙𝒕

𝑝 𝒛𝑡+𝛿 𝒙1, 𝒙2, … , 𝒙𝒕

𝑝 𝒙𝑡+𝛿 𝒙1, 𝒙2, … , 𝒙𝒕

𝑝 𝒙1, … , 𝒙𝑡−1 𝒙𝑡 , 𝒛𝑡 = 𝑝 𝒙1, … , 𝒙𝑡−1 𝒛𝑡

𝑝 𝒙1, … , 𝒙𝑡−1 𝒛𝑡−1, 𝒛𝑡 = 𝑝 𝒙1, … , 𝒙𝑡−1 𝒛𝑡−1

𝑝 𝒙𝑡+1, … , 𝒙𝑇 𝒛𝑡 , 𝒛𝑡+1 = 𝑝 𝒙𝑡+1, … , 𝒙𝑇 𝒛𝑡+1

𝑝 𝒙𝑡+2, … , 𝒙𝑇 𝒛𝑡+1, 𝒙𝑡+1 = 𝑝 𝒙𝑡+2, … , 𝒙𝑇 𝒛𝑡+1

𝑝 𝒙1, … , 𝒙𝑇 𝒛𝑡−1, 𝒛𝑡 = 𝑝 𝒙1, … , 𝒙𝑡−1 𝒛𝑡−1 𝑝 𝒙𝑡 𝒛𝑡

𝑝 𝒙𝑡+1, … , 𝒙𝑇 𝒛𝑡

𝑝 𝒙1, … , 𝒙𝑇 𝒛𝑡 = 𝑝 𝒙1, … , 𝒙𝑡 𝒛𝑡

𝑝 𝒙𝑡+1, … , 𝒙𝑇 𝒛𝑡

𝑝 𝒛𝑇+1 𝒛𝑇 , 𝒙1, … , 𝒙𝑇 = 𝑝 𝒛𝑇+1 𝒛𝑇

Smoothing: 𝑝 𝒛𝑡 𝒙1, 𝒙2, … , 𝒙𝑇

𝑝 𝒛𝑡 𝒙1, 𝒙2, … , 𝒙𝑇 =𝑝 𝒙1, 𝒙2, … , 𝒙𝑇 𝒛𝑡 𝑝(𝒛𝑡)

𝑝(𝒙1, 𝒙2, … , 𝒙𝑇)

=𝑝 𝒙1, 𝒙2, … , 𝒙𝑡 𝒛𝑡 𝑝(𝒙𝑡+1, … , 𝒙𝑇|𝒛𝑡)𝑝(𝒛𝑡)

𝑝(𝒙1, 𝒙2, … , 𝒙𝑇)

Filtering: 𝑝 𝒛𝑡 𝒙1, 𝒙2, … , 𝒙𝑡

=𝑝(𝒙1, 𝒙2, … , 𝒙𝑡 , 𝒛𝑡)𝑝(𝒙𝑡+1, … , 𝒙𝑻|𝒛𝑡)

𝑝(𝒙1, 𝒙2, … , 𝒙𝑇)

=𝛼 𝒛𝑡 𝛽(𝒛𝑡)

𝑝(𝒙1, 𝒙2, … , 𝒙𝑇)

Filtering and smoothing

𝛼(𝒛𝑡) = 𝑝(𝒙1, … , 𝒙𝑡 , 𝒛𝑡)

= 𝑝 𝒙1, … , 𝒙𝑡 𝒛𝑡 𝑝(𝒛𝑡)

= 𝑝 𝒙𝑡 𝒛𝑡 𝑝 𝒙1, … , 𝒙𝑡−1 𝒛𝑡 𝑝(𝒛𝑡)

= 𝑝 𝒙𝑡 𝒛𝑡 𝑝(𝒙1, … , 𝒙𝑡−1 , 𝒛𝑡)

= 𝑝 𝒙𝑡 𝒛𝑡 𝑝(𝒙1, … , 𝒙𝑡−1 , 𝒛𝑡 , 𝒛𝑡−1)

𝒛𝑡−1

= 𝑝 𝒙𝑡 𝒛𝑡 𝑝(𝒙1, … , 𝒙𝑡−1 , 𝒛𝑡|𝒛𝑡−1)p(𝒛𝑡−1) 𝒛𝑡−1

= 𝑝 𝒙𝑡 𝒛𝑡 𝑝(𝒙1, … , 𝒙𝑡−1 |𝒛𝑡−1, 𝒛𝑡)p(𝒛𝑡|𝒛𝑡−1)p(𝒛𝑡−1) 𝒛𝑡−1

Forward Probabilities

= 𝑝 𝒙𝑡 𝒛𝑡 𝑝 𝒙1, … , 𝒙𝑡−1 𝒛𝑡−1 p(𝒛𝑡−1)p(𝒛𝑡|𝒛𝑡−1) 𝒛𝑡−1

= 𝑝 𝒙𝑡 𝒛𝑡 𝑝(𝒙1, … , 𝒙𝑡−1 , 𝒛𝑡−1)p(𝒛𝑡|𝒛𝑡−1) 𝒛𝑡−1

= 𝑝 𝒙𝑡 𝒛𝑡 α(𝒛𝑡−1)p(𝒛𝑡|𝒛𝑡−1) 𝒛𝑡−1

Recursive formula and initial condition α(𝒛1)

𝛼 𝒛1 = 𝑝 𝒙1, 𝒛1 = 𝑝 𝒛1 𝑝(𝒙1|𝒛1) = 𝜋𝑘𝑝(𝒙1|𝑧1𝑘)𝑧1𝑘

𝑘=1

𝛼 𝑧𝑡1 = 𝑝(𝒙𝑡|𝑧𝑡1 = 1) 𝑎11𝑎 𝑧𝑡−1 1 + 𝑎21𝑎 𝑧𝑡−1 2

𝑎 𝑧𝑡 2 = 𝑝(𝒙𝑡|𝑧𝑡 2 = 1) 𝑎12𝑎 𝑧𝑡−1 1 + 𝑎22𝑎 𝑧𝑡−1 2

In total 𝑂(22) computation (in general O(𝐾2))

p(𝒙𝑡|𝑧𝑡1 = 1) 1 1

𝑎11

𝑎21

𝛼 𝑧𝑡−1 1 𝛼 𝑧𝑡 1

𝑎 𝑧𝑡−1 2

𝑝(𝒙𝑡|𝑧𝑡 2 = 1)

𝑎12

𝑎22

𝑎 𝑧𝑡−1 1

𝑎 𝑧𝑡−1 2

𝑝 𝒙1, … , 𝒙𝑡 = 𝑝 𝒙1, … , 𝒙𝑡 , 𝒛𝑡𝒛𝑡

= α 𝒛𝑡𝒛𝑡

= α(𝑧𝑡1)+α(𝑧𝑡2)!!

𝑝 𝒛𝑡|𝒙1, … , 𝒙𝑡 =𝑝(𝒙1, … , 𝒙𝑡 , 𝒛𝑡)

𝑝(𝒙1, … , 𝒙𝑡 )=

α(𝒛𝑡)

α(𝒛𝑡)𝒛𝑡

= 𝑎 (𝒛𝑡)

Evaluation: How can we compute 𝑝(𝒙1, … , 𝒙𝑇 )?

𝑝(𝒙1, … , 𝒙𝛵 ) = 𝑝(𝒙1, … , 𝒙𝛵 , 𝒛𝛵)

𝒛𝛵

= α(𝒛𝛵)

𝒛𝛵

Filtering

𝛽 𝒛𝑡 = 𝑝 𝒙𝑡+1, … , 𝒙𝑇 𝒛𝑡

= 𝑝 𝒙𝑡+1, … , 𝒙𝑇 , 𝒛𝑡+1 𝒛𝑡

𝒛𝑡+1

= 𝑝 𝒙𝑡+1, … , 𝒙𝑇 𝒛𝑡, 𝒛𝑡+1 𝑝(𝒛𝑡+1|𝒛𝑡)

𝒛𝑡+1

= 𝑝 𝒙𝑡+1, … , 𝒙𝑇 𝒛𝑡+1 𝑝(𝒛𝑡+1|𝒛𝑡)

𝒛𝑡+1

= 𝑝 𝒙𝑡+2, … , 𝒙𝑇 𝒛𝑡+1 𝑝(𝒙𝑡+1|𝒛𝑡+1)𝑝(𝒛𝑡+1|𝒛𝑡)

𝒛𝑡+1

= 𝛽(𝒛𝑡+1)𝑝(𝒙𝑡+1|𝒛𝑡+1)𝑝(𝒛𝑡+1|𝒛𝑡)

𝒛𝑡+1

Backward probabilities

𝑝(𝒙𝑡|𝑧𝑡+1 1 = 1)

𝑝(𝒙𝑡|𝑧𝑡+1 2 = 1)

𝑎11

𝑎21

𝛽 𝑧𝑡 1 𝛽 𝑧𝑡+1 1

β 𝑧𝑡 2 2

𝛽 𝑧𝑡+1 2

Computing Backward probabilities

𝛽 𝑧𝑡 1 = 𝛽(𝑧𝑡+1 1)𝑎11𝑝 𝒙𝑡 𝑧𝑡+1 1 = 1

+𝛽(𝑧𝑡+1 2)𝑎12𝑝 𝒙𝑡 𝑧𝑡+1 2 = 1

p(𝒙𝑡|𝑧𝑡+1 2 = 1)

𝛽 𝑧𝑡 2 = β(𝑧𝑡+1 1)𝑎21𝑝 𝒙𝑡 𝑧𝑡+1 1 = 1

+𝛽 𝑧𝑡+1 2 𝑎22𝑝 𝒙𝑡 𝑧𝑡+1 2 = 1

𝛽 𝒛𝑇 = 1 𝑝(𝒛𝑇 𝒙1, … , 𝒙𝛵 =𝑝(𝒙1, … , 𝒙𝛵 , 𝒛𝑻)

𝑝(𝒙1, … , 𝒙𝛵 )=

𝛼 𝒛𝑇 ∗ 1

𝑝(𝒙1, … , 𝒙𝛵 )

p(𝒙𝑡|𝑧𝑡+1 1 = 1) 1

𝑎21

𝑎22

𝛽 𝑧𝑡 1

β 𝑧𝑡 2

𝛽 𝑧𝑡+1 1

𝛽 𝑧𝑡+1 2

Computing Backward probabilities

𝑝(𝒛𝑡+2 𝒙1, … , 𝒙𝑡 = 𝑝(𝒛𝑡+2, 𝒛𝑡+1, 𝒛𝑡|𝒙1, … , 𝒙𝑡)

𝒛𝑡𝒛𝑡+1

= 𝑝 𝒛𝑡+2, 𝒛𝑡+1 𝒙1, … , 𝒙𝑡 , 𝒛𝑡 𝑝(𝒛𝑡|𝒙1, … , 𝒙𝑡)

𝒛𝑡𝒛𝑡+1

= 𝑝 𝒛𝑡+2, 𝒛𝑡+1 𝒛𝑡𝒛𝑡𝒛𝑡+1

𝑝(𝒛𝑡|𝒙1, … , 𝒙𝑡)𝑎 𝒛𝑡

= 𝑝(𝒛𝑡+2|𝒛𝑡+1)𝑝(𝒛𝑡+1|𝒛𝑡)𝑎 (𝒛𝑡)

𝒛𝑡𝒛𝑡+1

𝑨𝑇𝑨𝑇𝒂

Prediction

𝑝(𝒙𝑡+2 𝒙1, … , 𝒙𝑡 = 𝑝(𝒙𝑡+2|𝒛𝑡+2) 𝑝(𝒛𝑡+2|𝒙1, … , 𝒙𝑡)

𝒛𝑡+2

ξ 𝒛𝑡−1, 𝒛𝑡 = 𝑝(𝒛𝑡−1, 𝒛𝑡|𝒙1, … , 𝒙𝑇)

=𝑝(𝒙1, … , 𝒙𝑇|𝒛𝑡−1, 𝒛𝑡)𝑝(𝒛𝑡−1, 𝒛𝑡)

𝑝(𝒙1, … , 𝒙𝑇)

=𝑝 𝒙1, … , 𝒙𝑡−1 𝒛𝑡−1 𝑝(𝒙𝑡|𝒛𝑡)𝑝 𝒙𝑡+1, … , 𝒙𝑇 𝒛𝑡 𝑝(𝒛𝑡|𝒛𝑡−1)𝑝(𝒛𝑡−1)

𝑝(𝒙)

=α 𝒛𝑡−1 𝑝 𝒙𝑡 𝒛𝑡 𝑝 𝒛𝑡 𝒛𝑡−1 𝛽(𝒛𝑡)

𝑝(𝒙1, … , 𝒙𝛵)

Smoothed transition

We need to define an EM algorithm

Parameter Estimation (Baum-Welch algorithm)

𝑝(𝒛𝑡−1, 𝒛𝑡|𝒙1, … , 𝒙𝑇)

and we have all the necessary ingredients

𝑝 𝒛𝑡 𝒙1, 𝒙2, … , 𝒙𝑻

Summary

We saw how to perform

(a) Filtering

(b) Smoothing

(c) Evaluation

(d) Prediction

Next we will see how to perform (e) EM and (f) Decoding

Hidden Markov Models (HMM) · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) 1 2...

Documents

Discrete Fundamental Theorem of Surfacesytong/DDG/DFF.pdf · Discrete Fundamental Theorems • Local discrete fundamental theorem of surfaces – Discrete integrability condition:

Workshop 2: Recurrence Relations - Computer Science · 2020. 9. 1. · Recurrence Relations . Where does it end? ... Let’s solve the recurrence! 𝑇 =5+𝑇( 2) 𝑇1=1 5+𝑇(

Discrete Maths - Imperial College Londonyg/Discrete/notes.pdf · Discrete Maths Philippa Gardner ... Discrete Mathematics and its Applications, McGraw Hill ... J.K. Truss. Discrete

Impact characteristics of liquid nitrogen droplets...ሶ / ሶ • ሶ ~𝐴 ß 𝑇 −𝑇 𝛼𝑙 ç • ሶ ~𝐿 ሶ ≈𝐿 0.2( − æ)[1] 𝑄ሶ 𝑄ሶ 𝑎 ~𝐴 𝐿 𝑙

Discrete Probabilities4-Discrete Probabilities - Student

1 Discrete Structures & Algorithms Discrete Probability

Discrete Fourier Series & Discrete Fourier Transform

Discrete adjoint for CFD - ONERA€¦ · Discrete gradient calculation method Outline 1 Discrete gradient calculation method 2 Discrete gradient methods, intuition and checks 3 Discrete

PowerPoint Presentationaada/courses/15251f15/www/slides/lec17.pdfSki rental ≥8. 1. 𝐺𝐼= t⋅ 𝑇𝐼 2. 𝐺𝐼= u⋅ 𝑇(𝐼) 3. 𝐺𝐼= 𝐵 2 ⋅ 𝑇(𝐼) 4. 𝐺𝐼=

Máquina de Atwood¡quina de Atwood 𝐓 𝐓 𝑚1𝐠 𝑚2𝐠 𝑚1 −𝑇=𝑚1𝑎1 𝑇−𝑚2 =𝑚2𝑎2 𝑎1=𝑎2=𝑎 𝑎= (𝑚1−𝑚2) (𝑚1+𝑚2) 𝑇= ... 3a

Automatic Construction Of Robust Spherical Harmonic Subspaces · Automatic Construction Of Robust Spherical Harmonic Subspaces Patrick Snape Yannis Panagakis Stefanos Zafeiriou Imperial

Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &

22C:19 Discrete Math Discrete Probability

Causality detection methods for time series analysis€¦ · 2 stochastic processes {X t}, {Y t} Transfer Entropy (from Y to X) 𝑇 → does not necessarily equal to 𝑇 → infer

Expectation Maximization (Intuition) Expectation ... Maximization (Intuition) Expectation Maximization (Maths) 1 Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) •

Face Flow - Imperial College Londonaroussos/...FaceFlow_ICCV15.pdf · Face Flow Patrick Snape Anastasios Roussos Yannis Panagakis Stefanos Zafeiriou Department of Computing, Imperial

MAGMA: Multi-level acceleratedgradientmirror descentalgorithm for large ... · 2016. 10. 5. · Vahan Hovhannisyan∗, Panos Parpas∗, and Stefanos Zafeiriou∗ Abstract. Composite

Virtual University of Pakistan Pakistan...1 𝑇 Energy of S.H.M If T is the kinetic energy, V the potential energy and = 𝑇 + the total energy of a simple harmonic oscillator, then

Introductory discrete math. Discrete math definition

Mfg WIP Advisor Webcast 2013 1030Understanding Discrete Job ClosureUnderstanding Discrete Job ClosureUnderstanding Discrete Job ClosureUnderstanding Discrete Job ClosureUnderstanding