MCMC - Markov Chain Monte Carlo: One of the top ten algorithms of the 20th century

Markov Chain Monte Carlo (MCMC)

Presented by:

Monzur MorshedHabibur Rahman

TigerHATSwww.tigerhats.org

The International Research group dedicated to Theories, Simulation and Modeling, New Approaches, Applications, Experiences, Development, Evaluations, Education, Human, Cultural and Industrial Technology

TigerHATS - Information is power

• Markov Chain Monte Carlo:Markov Chain Process + Monte Carlo Integration

• MCMC: A way for random sampling method

• Markov Chain Monte Carlo (MCMC) method isconsidered to be one of the top ten algorithms of the 20thcentury

• The goal of MCMC is to sample x with a probabilityproportional to the distribution function π(x)


Markov Chain Monte Carlo

Markov Chain Monte Carlo methods generate a Markov chain of points that converges to a distribution of interest.

“Monte Carlo” : The methods employ randomness.


The basic idea of MCMC is:

• To construct a Markov chain such that:• Have the parameters as the state space, and • the stationary distribution is the posterior

probability distribution of the parameters• Simulate the chain• Treat the realization as a sample from the

posterior probability distribution

MCMC = sampling + continue search


• What is Markov Chain?

• A Markov chain is a mathematical model for stochastic system that generates random variable X1, X2, …, Xt, where the distribution

• The distribution of the next random variable depends only on the current random variable.

• The entire chain represents a stationary probabilitydistribution.

tx 1+tx1−tx

)|(),,,|( 1121 −− = tttt xxpxxxxp

MCMCMCMC is general purpose techniquefor generating fair samples from a probability in high-dimensional space, using random numbers (dice) drawn from uniform probability in certain range.

tx 1+tx1−tx

1−tz tz 1+tz

(Hidden) Markov chain states

Independent trials of dice

( )xpxt ~

],[~ baunifzt

Stochastic (non-deterministic behavior) techniques - based on the use of random numbers and probability statistics to investigate problems

Large system ->random configurations, data-> describe the whole system

"Hit and miss" integration is the simplest type

Monte Carlo Methods

The Monte Carlo principle

p(x): a target density defined over a high-dimensional space (e.g. the space of all possible configurations of a system under study)

The idea of Monte Carlo techniques is to draw a set of (iid) samples {x1,…,xN} from p in order to approximate p with the empirical distribution

Using these samples we can approximate integrals I(f) (or v large sums) with tractable sums that converge (as the number of samples grows) to I(f)

∑=

=≈N

i

ixxN

xp1

)( )(1)( δ

)()(1)()()(1

)( fIxfN

dxxpxffI N

N

i

i∫ ∑ →≈= ∞→=

iid: Independent and identically distributed random variables

Monte Carlo principle

• Given a very large set X and a distribution p(x) over it• We draw i.i.d. a set of N samples • We can then approximate the distribution using these

samples

∑=

==N

i

iN xx

Nx

1

)( )1(1)(p

X

p(x)

)p(xN ∞→→

iid: Independent and identically distributed random variables

How to build the Markov chain

Surprisingly, there are many ways to construct a Markov chain with stationary distribution π.

Perhaps the simplest is the Metropolis-Hastings algorithm.

Markov Chain Monte Carlo

Draw random numbers from the posterior distribution

Each number depends on the previous one Start from arbitrary value Simulation “finds” the posterior distribution

and provides random numbers from it Advantage: very complex models can be

analyzed Disadvantage: length of the searching phase

is difficult to identify

How MCMC works

Key idea is to construct a discrete time Markov chain X1, X2, X3, … on state space S whose stationary distribution is π.

If P(dy,dx) is the transitional kernel of the chain this means that

),()()( dxyPdydxS

ππ ∫=

How MCMC works (2)

Subject to some technical conditions,

Distribution of Xn → π as n →∞ Thus to obtain samples from π we simulate

the chain and sample from it after a “long time”.

Suppose that an orange juice company controls 20% of

the OJ market

Suppose they hire a market research company to predict

the effect on an aggressive ad campaign

Suppose they conclude:

• Someone using Brand A will stay using Brand A 90% probability

• Someone NOT using Brand A will switch to Brand A 70%

probability

Markov Process : Simple Example

Buy Orange Juice(OJ) once a week.• A = uses Brand A• A′ = uses other Brand

Transition Diagram:The transition matrix:

Markov Process

A A'

0.10.9 0.3

0.7

=

3.07.01.09.0

P

Initial state distribution matrix S0

S0 = [0.20 0.80] What the probability of uses Brand A after 1 week?

• S0 * P = [0.20 0.80]

= [0.74 0.26] = S1 (First State Matrix)

This is the probability after 1 week where 74% control over OJ Market.

Markov Chain

3.07.01.09.0

MCMC algorithms

Metropolis-Hastings algorithm Metropolis algorithm

• Mixtures and blocks Gibbs sampling Rejection Sampling Random Sampling Sequential Monte Carlo

Metropolis-Hastings Metropolis-Hastings is an MCMC model that can sample from

any distribution P, using a proposal distribution Q(x’; x).

• Initialize with random x.• Generate new x’ =

Proposal position according to Q(x’; x)

• Compute α = min( (P(x’) / P(x) ), 1)and accept change with probability α.

Gibbs Sampling

Gibbs sampling is a variety of Metropolis-Hastings sampling where the sampling step is always accepted.

For multivariate distributions, in Gibbs sampling only one parameter is changed at a time.

This makes Gibbs sampling particularly useful for multivariate distributions.

The Gibbs Sampler

Geman and Geman 1984, Gelfand and Smith 1990

)X,...,X,(X X k21=

with distribution ).X(

π

Consider a random vector

Suppose that the full set of conditional distributions

)x|(x i-iπ

where ).x,...,x,x,...,(x x k1i1-i1i- +=

The Gibbs Sampler

Further suppose that these distributions can be sampled from.

Start at some value ).x,...,x,(xx 0k

02

01

0 =

The algorithm:

Sample from 11x )x,...,x ,x|(x 0

k03

021π

Sample from 12x )x,...,x ,x|(x 0

k03

112π

Sample from 13x )x...,x,x ,x|(x 0

k04

02

112π

The Gibbs Sampler

Cycle through the components again:

Sample from 1kx )x,...,x ,x|(x 1

1-k12

111π

nix )x...,x,,...xx ,x|(x 1-n

k1-n1i

n1-i

n2

n1i +π

At time n, update the ith component by drawing a value

from

Example: Random Walker (Sample)

A drinking walker walks in discrete steps. In each step, he has ½ probability walk to the right, and ½ probability to the left. He doesn’t remember his previous steps.

Rejection Sampling Method

Bayes’ Theorem (Rule, Law)

Bayes’ Theorem: Let events A1,…,Ak form a partition of the space S such that Pr(Aj) > 0 for all j and let B be any event such that Pr(B) > 0. Then for i = 1,..,k:

)(B|A)(A )( B | A ) ( A | B ) ( A

k kk

iii ∑

=PrPr

PrPrPr

Proof:

∑=

∩=

k kk

iiii

ABAABA

BBABA

)|Pr()Pr()|Pr()Pr(

)Pr()Pr()|Pr(

Bayes’ Theorem is just a simple rule for computing the conditional probability of events Ai given B from the conditional probability of B given each event Ai and the unconditional probability of each Ai

Interpretation of Bayes’ Theorem

)(B|A)(A )( B | A ) ( A | B ) ( A

k kk

iii ∑

=PrPr

PrPrPr

Pr(Ai) = Prior distribution for the Ai. It summarizes your beliefs about the probability of event Ai before Ai or B are observed.

Pr( B | Ai ) = The conditional probability of B given Ai. It summarizes the likelihood of event B given Ai.

∑k Pr( Ak ) Pr( B | Ak ) = The normalizing constant. This is equal to the sum of the quantities in the numerator for all events Ak. Thus, P( Ai | B ) represents the likelihood of event Ai relative to all other elements of the partition of the sample space.

Pr( Ai | B ) = The posterior distribution of Ai given B. It represents the probability of event Ai after Ai has B has been observed.

Example of Bayes’ Theorem

What is the probability in a survey that someone is black given that they respond that they are black when asked?

- Suppose that 10% of the population is black, so Pr(B) = 0.10- Suppose that 95% of blacks respond Yes, when asked if they are black, so

Pr( Y1 | B ) = 0.95.- Suppose that 5% of non-blacks respond Yes, when asked if they are

black, so Pr( Y1 | BC) = 0.05

68.14.095.

)05)(.9.0()95)(.1.0()95.0)(1.0(1Pr

)|Pr()Pr()|PrPrPrPrPr

1

11

11

==+

==

+=

) ( B | Y

BYBB(Y(B) | B )( Y( B ) ) ( B | Y CC

We reach the surprising conclusion that even if 95% of black and non-black respondents correctly classify themselves according to race, the probability that someone is black given that they say they are black is less than 0.70

Applications

Computer vision• Object tracking demo [Blake&Isard]

Speech & audio enhancement Web statistics estimation Regression & classification Bayesian networks Genetics & molecular biology Robotics, etc.

Conclusion• MCMC

• The Markov Chain Monte Carlo methods cover a variety of different fields and applications.

• There are great opportunities for combining existing sub-optimal algorithms with MCMC in many machine learning problems.

• Some areas are already benefiting from sampling methods include:

Tracking, restoration, segmentationProbabilistic graphical modelsClassificationData association for localizationClassical mixture models.

TigerHATSwww.tigerhats.org

Thank you

Bangladeshi Scientists and Researchers Network

https://www.facebook.com/groups/BDSRNet

Documents

MCMC - Markov Chain Monte Carlo: One of the top ten algorithms of the 20th century