Intro to Comp Genomics Lecture 5: Learning models using EM

Intro to Comp Genomics

Lecture 5: Learning models using EM

Mixtures of Gaussians

i

iii xNpxP ),;()|(

),;()|( xNxP

i

iii xNpxP ),;()|(

We have experimental results of some valueWe want to describe the behavior of the experimental values:Essentially one behavior? Two behaviors? More?In one dimension it may look very easy: just looking at the distribution will give us a good idea..

We can formulate the model probabilistically as a mixture of normal distributions.

As a generative model: to generate data from the model, we first select the sub-model by sampling from the mixture variable. We then generate a value using the selected normal distribution.

If the data is multi dimensional, the problem is becoming non trivial.

Inference

i

iii xNpxP ),;()|(

Let’s represent the model as:

i

iii xNpxP ),;()|(

iiiisss

s

xNpxNp

sxssxsxsP

),;(/),;(

)'|Pr()'Pr(/)|Pr()Pr()|(

What is the inference problem in our model?

s

sxsx )|Pr()Pr()Pr(

Inference: computing the posterior probability of a hidden variable given the data and the model parameters.

For p0=0.2, p1=0.8, 0=0, 1=1, 0=1,1=0.2, what is Pr(s=0|0.8) ?

Estimation/parameter learning

i

iii xNpxP ),;()|(

Given data, how can we estimate the model parameters?

i jjjij

ii

n

xNp

xxxL

),;(

)|Pr(),..,|( 1

Transform it into an optimization problem!

Likelihood: a function of the parameters. Defined given the data.

Find parameters that maximize the likelihood: the ML problem

Can be approached heuristically: using any optimization technique.

But it is a non linear problem which may be very difficult

Generic optimization techniques:

Gradient ascent:

Find

Simulation annealing

Genetic algorithms

And more..

)),..,|((maxarg 11

nkk

ak xxLaL

The EM algorithm for mixtures

i

iii xNpxP ),;()|(

We start by guessing parameters:

We now go over the samples and compute their posteriors (i.e., inference):

iis iissxNpxNpxsP ),;(/),;(),|( 00000

We use the posteriors to compute new estimates for the expected sufficient statistics of each distribution, and for the mixture coefficients:

ii

iii

xsP xsP

xsPxxE

s ),|(

),|(][

0

0

)|(1

ii

iisi

xsP xsP

xsPxxV

s ),|(

),|()(][

0

021

)|(

21

i

ixsPNps

),|(1 01

Continue iterating until convergence.

The EM theorem: the algorithm will converge and will improve likelihood monotonically

But:

No Guarantee of finding the optimumOr of finding anything meaningful

The initial conditions are critical:

Think of starting from 0=0, 1=10, 1,2=1

Solutions: start from “reasonable” solutionsTry many starting points

-1 0 1

Hidden Markov Models

Observing only emissions of states to some probability space EEach state is equipped with an emission distribution (x a state, e emission))|Pr( xe

)|Pr()|Pr(),Pr( 1 iiiii sesses

Emission space

Caution! This is NOT the HMM Bayes Net

1.Cycles2.States are NOT random vars!

Simple example: Mixture with “memory”

h i

iiio

h

hxPhhhxPh

hxP

xP

)|()|Pr()|()Pr(

)|,(

)|(

10

We sample a sequence of dependent valuesAt each step, we decide if we continue to sample from the same distribution or switch with probability p

We can compute the probability directly only given the hidden variables.

P(x) is derived by summing over all possible combination of hidden variables. This is another form of the inference problem (why?)

There is an exponential number of h assignments, can we still solve the problem efficiently?

B A

)|( ABP

)|( BAP

)|( AxP )|( BxP

Inference in HMM

Forward formula:

0:1?)(

)'|Pr()|Pr(

0

)('

1'

startsf

ssfsef

s

sNs

is

iis

)|Pr(

)'|Pr()|'Pr( 1

)('

1'

sfinishb

sessbb

Ns

i

sNs

is

is

Backward formula: Emissions

States FinishStart

isf

Emissions

States FinishStart

isb

S

Ls sfinishfL )|Pr(

S

s beginsbL )|Pr(1

Inference in HMM

Forward formula:

0:1?)(

)'|Pr()|Pr(

0

)('

1'

startsf

ssfsef

s

sNs

is

iis

)|Pr(

)'|Pr()|'Pr( 1

)('

1'

sfinishb

sessbb

Ns

i

sNs

is

is

Backward formula: Emissions

States FinishStart

isf

Emissions

States FinishStart

1isb

S

Ls sfinishfL )|Pr(

S

s beginsbL )|Pr(1

EM for HMMsEmissions

States FinishStart

i

kkiis

is

k sssebfsL

ss ),'|Pr(),'|Pr()(

1~),'|Pr( 1

'1

The posterior probability for transition from s’ to s after character i?

With multiple sequence, assume independence (accumulate stats)

Claim: HMM EM is monotonically improving the likelihood

The posterior probability for emitting the i’th character from state s?

)Pr()(

1~),|Pr( 1 eebf

sLse i

i

is

is

k

The EM theorem for mixtures simplified

Assume that we know which distribution generated each sample (samples Si generated from distribution i)

We want to maximize the model’s likelihood, given this extra information:

i Sj

jSi

i

ii

i xNpxL ),;()|( ||

ii

iii

xsP xsP

xsPxxE

s ),|(

),|(][

0

0

)|(1

ii

iisi

xsP xsP

xsPxxV

s ),|(

),|()(][

0

021

)|(

21

i

ixsPNps

),|(1 01

i Sjj

ii

i

i

iixN

pSxLL

),;(log

)log(||)|(

2

2222

1

1111

),;(logmax

),;(logmax

)log(||max

,

,

Sjj

Sjj

ii

ip

xN

xN

pSi

Solve separately:

N

SpS ii

ii

pi

||)log(||maxarg

“multinomial estimator” solve using Lagrange multipliers:

ii

i

iii

ii

iii

ii

p

pp

L

ppSL

ppSi

1,0

)1()log(||

1,)log(||max


Assume that we know which distribution generated each sample (samples Si generated from distribution i)

We want to maximize the model’s likelihood, given this extra information:

i Sj

jSi

i

ii

i xNpxL ),;()|( ||

ii

iii

xsP xsP

xsPxxE

s ),|(

),|(][

0

0

)|(1

ii

iisi

xsP xsP

xsPxxV

s ),|(

),|()(][

0

021

)|(

21

i

ixsPNps

),|(1 01

i Sjj

ii

i

i

iixN

pSxLL

),;(log

)log(||)|(

2

2222

1

1111

),;(logmax

),;(logmax

)log(||max

,

,

Sjj

Sjj

ii

ip

xN

xN

pSi

Solve separately:

Normal distribution estimator: using observed sufficient statistics (an exponential family)

][],[

),;(logmaxarg1

11,

xVxE

xN

ii

ii

SS

Sjj

We found the global optimum of the likelihood in the case of full data.


Assume now that each sample i is known to be from distribution j with probability Pij. We can write down:

ij

ii

p

j iji xNpExpQ )),;(()(

ii

iii

xsP xsP

xsPxxE

s ),|(

),|(][

0

0

)|(1

ii

iisi

xsP xsP

xsPxxV

s ),|(

),|()(][

0

021

)|(

21

i

ixsPNps

),|(1 01

i jjij

ii

jij

iixNp

ppxQ

),;(log

)log()|(

2

2222

1111

),;(logmax

),;(logmax

)log(max

2,

1,

Sjjj

jjj

ii j

ijp

xNp

xNp

ppi

Solve separately:

Same maximization holds.

In the EM algorithm we used:

),|( kjij xisPp

Deriving the EM formula.

In this case Q is dependent on the current parameters, so we call it:

What is missing? Q is not L!

)|( kQ

Expectation-Maximization

),(),|()|,(

)|,(log)|(log

xPxhPxhP

xhPxPh

),|(log)|,(log)|(log xhPxhPxP

h h

kk xhPxhPxhPxhPxP ),|(log),|()|,(log),|()|(log

h

kk shPshPQ )|,(log),|()|(

h

kkkkk

k

shP

shPshPQQ

sPsP

),|(

),|(log),|()|()|(

)|(log)|(log

Relative entropy>=0EM maximization

)|(maxarg1 kk Q

Dempster

KL-divergence

0)(min

log)1(log)

1()(max

)(log)()(

PH

Kk

Pk

PPH

xPxPPH

i

iii

Entropy (Shannon)

i i

ii xQ

xPxPQPD

)(

)(log)()||( Kullback-leibler divergence

iii

i i

ii

i i

ii xPxQ

xP

xQxP

xP

xQxPQPD 0)()(1

)(

)()(

)(

)(log)()|||(

1)log( uu

)||()||( PQDQPD

KL

Shannon

Not a metric!!

Bayesian learning vs. Maximum likelihood

)|(maxarg DLML Maximum likelihood estimator

Introducing prior beliefs on the process(Alternatively: think of virtual evidence)Computing posterior probabilities on the parameters

Parameter Space

MLE

No prior beliefs

dD

DPME

MAP

)|(Pr

)Pr()|Pr(maxarg

Parameter Space

PME

Beliefs

MAP

Preparations:• Get your hand on the ChIP-seq

profiles of CTCF and PolII in hg chr17• Cut the data into segments of 50,000

data points

Modeling:• Use EM to build a probabilistic model

for the peak signals and the background

• Use heuristics for peak finding to initialize the EM

Analysis:• Test if your model for single peak

structure is as good as the model for two peak structures.

• Compute the distribution of peaks relative to transcription start sites

Your Task

Preparations:

Background on ChIP-seq

CTCF and PolII

Modeling ChIP-seq, binning

Your Task

B

P1

P2

P3

P..


profiles of CTCF and PolII in hg chr17, bin-size = 50bp

• Cut the data into segments of 50,000 data points


for the peak signals and the background.





Your TaskModeling

),;()|( 111 xNPxP

),;()|( 222 xNPxP

),;()|( 333 xNPxP

),;()|( 444 xNPxP

),;()|( xNBxP

The model use k-states for the peak and one state for the backgroundUse K=40.

S

F

Your Task










Your TaskModeling

Implement HMM inference: forward-backward - let’s write them together

Make sure your total probability equals for the forward and backward algorithm!

Implement the EM update rules - let’s write them together

Run EM from multiple random points and record the likelihoods you derive

Implement smarter initialization: take the average values around all probes with value over a threshold.

Compute posterior peak probabilities: report all loci with P(Peak)>0.8

Your Task










Your TaskAnalysis

Compare the two peak structures you get (from CTCF and PolII)

Retrain a model together on the two datasets

Compute the log-likelihood of the unified model and compare to the sum of likelihood for the two models

Optional: test if the difference is significant by:-sampling data from the unified model-training two models on the synthetic data and compute the likelihood delta as for real data

-Use a set of known TSS to compute the distribution of peaks relative to genes

Documents

Intro to Comp Genomics Lecture 5: Learning models using EM