HMM by Zaheer Ahmad

Hidden Markov Model

Zaheer Ahmad

PhD Scholar

[email protected]

Department of Computer Science University of Peshawar

1/20/2011 1

mailto:[email protected]

AGENDA

• Markov Process/Model/Chain

• Orders of Markov Models

• Working of Markov Models

• Hidden Markov Model---As a double Probability function

• How HMM is used for NLP and DIP

• MATLAB Toolbox for Markov Model

Markov processes

Introduction

• Markov processes are examples of stochastic processes—processes that generate random sequences of outcomes or states according to certain probabilities.

• Markov processes are distinguished by being memoryless—their next state depends only on their current state, not on the history that led them there.

Example

• A game of snakes and ladders or any other game whose moves are determined entirely by dice is a Markov chain.

• It doesn't depend on how things got to their current state.

Usage Sequential Data

• data samples which are not independent from each other.

• Markov chain and hidden Markov model are probably the simplest models which can be used to model sequential data.

Markov Model

• In probability theory, a Markov model is a stochastic model that assumes the Markov property.

• A hypothetical description of the Markov Process

Markov Chain

• Often, the term "Markov chain" is used to mean a Markov process which has a discrete (finite or countable) state-space.

Where a state space is a directed graph where each possible state of a dynamical system is represented by a vertex, and there is a directed edge from a to b if and only if ƒ(a) = b where the function f defines the dynamical system. • A Markov chain is a random process with the Markov

property, i.e. the property, simply say…..

Markov property

• The Markov property states that the conditional probability distribution for the system at the next step (and in fact at all future steps) given its current state depends only on the current state of the system, and not additionally on the state of the system at previous steps.

Discrete Time random process

• A "discrete-time" random process means a system which is in a certain state at each "step", with the state changing randomly between steps.

• The steps are often thought of as time, but they can equally well refer to physical distance or any other discrete measurement;

• formally, the steps are just the integers or natural numbers, and the random process is a mapping of these to states.

Markov Model / chains • Markov chains are mathematical descriptions of Markov models with

a discrete set of states. Markov chains are characterized by:

• A set of states {1, 2, ..., M}

• An M-by-M transition matrix T whose i,j entry is the probability of a transition from state i to state j. The sum of the entries in each row of T must be 1, because this is the sum of the probabilities of making a transition from a given state to each of the other states.

• A set of possible outputs, or emissions, {s1, s2, ... , sN}. By default, the set of emissions is {1, 2, ... , N}, where N is the number of possible emissions, but you can choose a different set of numbers or symbols.

• An M-by-N emission matrix E whose i,k entry gives the probability of emitting symbol sk given that the model is in state i.

• Markov chains begin in an initial state i0 at step 0. The chain then transitions to state i1 with probability , and emits an output with probability . Consequently, the probability of observing the sequence of states and the sequence of emissions in the first r steps, is

Markov decision process

• A Markov decision process is a Markov chain in which state transitions depend on the current state and an action vector that is applied to the system. Typically, a Markov decision process is used to compute a policy of actions that will maximize some utility with respect to expected rewards. It is closely related to Reinforcement learning, and can be solved with value iteration and related methods.

Order 0 Markov Models

• The simplest Markov process is a first order process, where the choice of state is made purely on the basis of the previous state.

• Notice this is not the same as a deterministic system, since we expect the choice to be made probabalistically, not deterministically.

Order 1 Markov Models

• An order 1 (first-order) Markov model has a memory of size 1. It is defined by a table of probabilities pr(xt=Si | xt-1=Sj), for i = 1..k & j = 1..k. You can think of this as k order 0 Markov models, one for each Sj.

Order m Markov Models

• The order of a Markov model of fixed order, is the length of the history or context upon which the probabilities of the possible values of the next state depend.

For example,

• the next state of an order 2 (second-order) Markov Model depends upon the two previous states.

How Markov Process Works

• Consider we have three all day weather, which could be sunny (S), cloudy (C), or Rainy (R).

• From the history of the weather of the town under investigation we have the following table

• Table-1 shows probabilities of having certain state of tomorrow's weather and being in

certain condition today:

• The summation of Pr will be 1

• Assume tomorrow’s weather depends only on today’s condition as it is in consistency with the first order Markov chain

• We refer to the weather conditions by state q that are sampled at instant t and

• The problem is to find the probability of weather condition of tomorrow given today's condition P(qt+1 /qt).

• An acceptable approximation for n instants history is :

P(qt+1/qt , qt-1 , qt-2 , ….. , qt-n ) ≈ P(qt+1 /qt)

• Given today as sunny (S) what is the probability that the next following five days are S , C , C , R and S, having the above model?

• The answer resides in the following formula using first order Markov chain:

• P(q1 = S, q2=S, q3=C, q4=C, q5=R, q6=S) = P(S).P(q2=S/q1=S). P(q3=C/q2=S). P(q4=C/q3=C). P(q5=R/q4=C). P(q6=S/q5=R) = 1 x 0.7 x 0.2 x 0.8 x 0.15 x 0.15 = 0.00252

Finite state representation of weather forecast problem

Transition Matrix and its Calculations

=0.5*0.5+0.25*0.5+0.25*0.25, =0.5*0.25+0+0.25*0.25, =0.5*0.25+0.25*0.5+0.5*0.25 =0.5*0.5+0+0.5*0.25, , =, , ,

In general, if a Markov chain has r states, then

Hidden Markov Model (HMM)

• In HMM we observe a sequence of emissions, but do not know the sequence of states the model went through to generate the emissions. Analyses of hidden Markov models seek to recover the sequence of states from the observed data.

State-emission HMM

Two kinds of parameters: • Transition probability: P(sj

| si) • Output (Emission) probability: P(wk | si)

w1 w4 w1

s1 s2 sN

…

w5 w3 w1

Hidden Markov model (HMM)

• This notion implies the double stochastic process.

• More precisely, the HMM is a probabilistic pattern matching technique in which the observations are considered to be the output of stochastic process and consists of an underlying Markov chain.

• It has two components: a finite state Markov chain and a finite set of output probability distribution

Rabiner’s Example • simplification of Jinni Example

• Assume that we have two persons, one doing an experiment and the other is an outside observer.

• Let us consider that we have N urns (states) numbered from S1 to SN and

• in each urn there are M coloured balls (observations) distributed in different proportions. Also

• we have a black bag belongs to each urn, each bag contains 100 counters numbered by three numbers.

• These numbers are the current urn number Si and the following two urns numbers Si+1 and Si+2 in probability proportions of .8, .15, and .05 respectively.

• The counters of the bag belonging to the urn just before the last are carrying one of two numbers only; SN-1 and SN in probabilities of .9 and .1 respectively.

• We assume that the starting urn (state) is always urn1 (S1) and we end up in urnN (SN).

• The last urn need no bag as we suggest to stay their when we reach it till the end of the experiment.

• We start the experiment at time t =1 by drawing a ball from urn1 and register the colour then return it back to the urn.

• Then draw a counter from the corresponding urn bag.

• The expected possible numbers on the counters are: 1 (stay in urn1), or 2 (move to the next urn), or 3 (jump to the third urn).

• We continue with the same procedure of drawing a

counter then a ball from the corresponding urn and registering the ball colours till we reach state N and stay there till the end of the experiment at instant T

• The outcome of this experiment is a series of coloured balls (observations) which could be considered as a sequence of events governed by the probability distribution of the balls inside each urn and by the counters existing in each bag.

• The outside observer has no idea about which urn a ball at any instant has drawn from (hidden states), what he knows is only the observation sequence of the coloured balls(observations).

Some Conclusions

• Several things could be concluded from this experiment :

1 – The starting urn is always urn1 (S1).

2 – The urn which has been left can not be visited again (i.e. moving from left to right direction).

3 – Movements are either by one or two urns to the right.

4 – The last urn visited is always urnN (SN).

• A chain of 5 urns (states) is shown in Fig.( 2 ).

The principal cases of HMM

• There are three main cases to be dealt with to formulate a successful HMM

Case 1: Evaluation

Case 2: Decoding

Case 3: Training

Case 1: Evaluation

Given:

• a model λ = (A , B , π ) ready to be used.

• testing observation sequence O = O1 , O2 , O3 ,.........., OT-1 , OT .

• Action:

• compute P(O/λ) ; the probability of the observation sequence given the model.

• Find All Possible Paths and Pr of each

Case 2: Decoding Given:

• a model λ = (A , B , π ) ready to be used.

• testing or training observation sequence O = O1 , O2 , O3 ,.........., OT-1 , OT .

• Action:

• track the optimum state sequence Q = q1 , q2 , q3 ,........., qT-1 , qT

• that most likely produce the given observations, using the given model.

Maximum probability along the best probable state sequence path of a given observation sequences

Case 3: Training

Training procedure to optimize the model parameters to obtain the best model that

represent certain set of observations

Baum-Welch (Forward–Backward) Algorithm

• It is an iterative method to reach the local maximas of the probability function P(O/λ).

• This model always converges but the global maximisation can not be assured

Uses in NLP and DIP

• Sound / Phonems ( sequence ) as states

– Characters/ Words as Observations

• Face Recognition ( top to bottom sequence) as states

– Different shapes/structures as observations

MATLAB toolbox for HMM

• Generating a Test Sequence

• The following commands create the transition and emission matrices

• TRANS = [.9 .1; .05 .95;];

• EMIS = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6;... 7/12, 1/12, 1/12, 1/12, 1/12, 1/12];

• To generate a random sequence of states and emissions from the model, use hmmgenerate:

• [seq,states] = hmmgenerate(1000,TRANS,EMIS);

http://www.mathworks.com/help/toolbox/stats/hmmgenerate.html

• Estimating the State Sequence

• Given the transition and emission matrices TRANS and EMIS, the function hmmviterbi uses the Viterbi algorithm to compute the most likely sequence of states the model would go through to generate a given sequence seq of emissions:

• likelystates = hmmviterbi(seq, TRANS, EMIS);

http://www.mathworks.com/help/toolbox/stats/hmmviterbi.html

• Using hmmtrain. If you do not know the sequence of states states, but you have initial guesses for TRANS and EMIS, you can still estimate TRANS and EMIS using hmmtrain.

http://www.mathworks.com/help/toolbox/stats/hmmtrain.html

Conclusion • Advantages

– Has contributed quite a bit to speech recognition – With algorithms we have described, computation is

reasonable – Complex processes can be modeled with low-

dimensional data – Works well for time varying classification

• other examples: gesture recognition, formant tracking

• Limitations – Assumption that successive observations are independent – First order assumption: probability state at time t only depends on

state at time t-1 – Need to be “tailor made” for specific application – Needs lots of training data, in order to see all observations

ouY

41

Thank

References • Rabiner__A_Tutorial_on_Hidden_Markov_Models_and_Selec

ted_Applications_in_Speech_Recognition

• The Concepts of Hidden Markov Model inSpeech Recognition Technical Report TR99/09 Waleed H. Abdulla and Nikola K. KasabovDepartment ofKnowledge Engineering Lab Information Science Department University of Otago New Zealand

• Face Detection and Recognition using HMM by Ara.v.Nefian

• http://www.mathworks.com/help/toolbox/stats/f8368.html

• Wikiiipedia…

• And……. more…

42

http://www.mathworks.com/help/toolbox/stats/f8368.html

http://www.mathworks.com/help/toolbox/stats/f8368.html

Documents

HMM by Zaheer Ahmad