20
Intro to Comp Genomics Lecture 5: Learning models using EM

Intro to Comp Genomics Lecture 5: Learning models using EM

  • View
    219

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Intro to Comp Genomics Lecture 5: Learning models using EM

Intro to Comp Genomics

Lecture 5: Learning models using EM

Page 2: Intro to Comp Genomics Lecture 5: Learning models using EM

Mixtures of Gaussians

i

iii xNpxP ),;()|(

),;()|( xNxP

i

iii xNpxP ),;()|(

We have experimental results of some valueWe want to describe the behavior of the experimental values:Essentially one behavior? Two behaviors? More?In one dimension it may look very easy: just looking at the distribution will give us a good idea..

We can formulate the model probabilistically as a mixture of normal distributions.

As a generative model: to generate data from the model, we first select the sub-model by sampling from the mixture variable. We then generate a value using the selected normal distribution.

If the data is multi dimensional, the problem is becoming non trivial.

Page 3: Intro to Comp Genomics Lecture 5: Learning models using EM

Inference

i

iii xNpxP ),;()|(

Let’s represent the model as:

i

iii xNpxP ),;()|(

iiiisss

s

xNpxNp

sxssxsxsP

),;(/),;(

)'|Pr()'Pr(/)|Pr()Pr()|(

What is the inference problem in our model?

s

sxsx )|Pr()Pr()Pr(

Inference: computing the posterior probability of a hidden variable given the data and the model parameters.

For p0=0.2, p1=0.8, 0=0, 1=1, 0=1,1=0.2, what is Pr(s=0|0.8) ?

Page 4: Intro to Comp Genomics Lecture 5: Learning models using EM

Estimation/parameter learning

i

iii xNpxP ),;()|(

Given data, how can we estimate the model parameters?

i jjjij

ii

n

xNp

xxxL

),;(

)|Pr(),..,|( 1

Transform it into an optimization problem!

Likelihood: a function of the parameters. Defined given the data.

Find parameters that maximize the likelihood: the ML problem

Can be approached heuristically: using any optimization technique.

But it is a non linear problem which may be very difficult

Generic optimization techniques:

Gradient ascent:

Find

Simulation annealing

Genetic algorithms

And more..

)),..,|((maxarg 11

nkk

ak xxLaL

Page 5: Intro to Comp Genomics Lecture 5: Learning models using EM

The EM algorithm for mixtures

i

iii xNpxP ),;()|(

We start by guessing parameters:

We now go over the samples and compute their posteriors (i.e., inference):

iis iissxNpxNpxsP ),;(/),;(),|( 00000

We use the posteriors to compute new estimates for the expected sufficient statistics of each distribution, and for the mixture coefficients:

ii

iii

xsP xsP

xsPxxE

s ),|(

),|(][

0

0

)|(1

ii

iisi

xsP xsP

xsPxxV

s ),|(

),|()(][

0

021

)|(

21

i

ixsPNps

),|(1 01

Continue iterating until convergence.

The EM theorem: the algorithm will converge and will improve likelihood monotonically

But:

No Guarantee of finding the optimumOr of finding anything meaningful

The initial conditions are critical:

Think of starting from 0=0, 1=10, 1,2=1

Solutions: start from “reasonable” solutionsTry many starting points

-1 0 1

Page 6: Intro to Comp Genomics Lecture 5: Learning models using EM

Hidden Markov Models

Observing only emissions of states to some probability space EEach state is equipped with an emission distribution (x a state, e emission))|Pr( xe

)|Pr()|Pr(),Pr( 1 iiiii sesses

Emission space

Caution! This is NOT the HMM Bayes Net

1.Cycles2.States are NOT random vars!

Page 7: Intro to Comp Genomics Lecture 5: Learning models using EM

Simple example: Mixture with “memory”

h i

iiio

h

hxPhhhxPh

hxP

xP

)|()|Pr()|()Pr(

)|,(

)|(

10

We sample a sequence of dependent valuesAt each step, we decide if we continue to sample from the same distribution or switch with probability p

We can compute the probability directly only given the hidden variables.

P(x) is derived by summing over all possible combination of hidden variables. This is another form of the inference problem (why?)

There is an exponential number of h assignments, can we still solve the problem efficiently?

B A

)|( ABP

)|( BAP

)|( AxP )|( BxP

Page 8: Intro to Comp Genomics Lecture 5: Learning models using EM

Inference in HMM

Forward formula:

0:1?)(

)'|Pr()|Pr(

0

)('

1'

startsf

ssfsef

s

sNs

is

iis

)|Pr(

)'|Pr()|'Pr( 1

)('

1'

sfinishb

sessbb

Ns

i

sNs

is

is

Backward formula: Emissions

States FinishStart

isf

Emissions

States FinishStart

isb

S

Ls sfinishfL )|Pr(

S

s beginsbL )|Pr(1

Page 9: Intro to Comp Genomics Lecture 5: Learning models using EM

Inference in HMM

Forward formula:

0:1?)(

)'|Pr()|Pr(

0

)('

1'

startsf

ssfsef

s

sNs

is

iis

)|Pr(

)'|Pr()|'Pr( 1

)('

1'

sfinishb

sessbb

Ns

i

sNs

is

is

Backward formula: Emissions

States FinishStart

isf

Emissions

States FinishStart

1isb

S

Ls sfinishfL )|Pr(

S

s beginsbL )|Pr(1

Page 10: Intro to Comp Genomics Lecture 5: Learning models using EM

EM for HMMsEmissions

States FinishStart

i

kkiis

is

k sssebfsL

ss ),'|Pr(),'|Pr()(

1~),'|Pr( 1

'1

The posterior probability for transition from s’ to s after character i?

With multiple sequence, assume independence (accumulate stats)

Claim: HMM EM is monotonically improving the likelihood

The posterior probability for emitting the i’th character from state s?

)Pr()(

1~),|Pr( 1 eebf

sLse i

i

is

is

k

Page 11: Intro to Comp Genomics Lecture 5: Learning models using EM

The EM theorem for mixtures simplified

Assume that we know which distribution generated each sample (samples Si generated from distribution i)

We want to maximize the model’s likelihood, given this extra information:

i Sj

jSi

i

ii

i xNpxL ),;()|( ||

ii

iii

xsP xsP

xsPxxE

s ),|(

),|(][

0

0

)|(1

ii

iisi

xsP xsP

xsPxxV

s ),|(

),|()(][

0

021

)|(

21

i

ixsPNps

),|(1 01

i Sjj

ii

i

i

iixN

pSxLL

),;(log

)log(||)|(

2

2222

1

1111

),;(logmax

),;(logmax

)log(||max

,

,

Sjj

Sjj

ii

ip

xN

xN

pSi

Solve separately:

N

SpS ii

ii

pi

||)log(||maxarg

“multinomial estimator” solve using Lagrange multipliers:

ii

i

iii

ii

iii

ii

p

pp

L

ppSL

ppSi

1,0

)1()log(||

1,)log(||max

Page 12: Intro to Comp Genomics Lecture 5: Learning models using EM

The EM theorem for mixtures simplified

Assume that we know which distribution generated each sample (samples Si generated from distribution i)

We want to maximize the model’s likelihood, given this extra information:

i Sj

jSi

i

ii

i xNpxL ),;()|( ||

ii

iii

xsP xsP

xsPxxE

s ),|(

),|(][

0

0

)|(1

ii

iisi

xsP xsP

xsPxxV

s ),|(

),|()(][

0

021

)|(

21

i

ixsPNps

),|(1 01

i Sjj

ii

i

i

iixN

pSxLL

),;(log

)log(||)|(

2

2222

1

1111

),;(logmax

),;(logmax

)log(||max

,

,

Sjj

Sjj

ii

ip

xN

xN

pSi

Solve separately:

Normal distribution estimator: using observed sufficient statistics (an exponential family)

][],[

),;(logmaxarg1

11,

xVxE

xN

ii

ii

SS

Sjj

We found the global optimum of the likelihood in the case of full data.

Page 13: Intro to Comp Genomics Lecture 5: Learning models using EM

The EM theorem for mixtures simplified

Assume now that each sample i is known to be from distribution j with probability Pij. We can write down:

ij

ii

p

j iji xNpExpQ )),;(()(

ii

iii

xsP xsP

xsPxxE

s ),|(

),|(][

0

0

)|(1

ii

iisi

xsP xsP

xsPxxV

s ),|(

),|()(][

0

021

)|(

21

i

ixsPNps

),|(1 01

i jjij

ii

jij

iixNp

ppxQ

),;(log

)log()|(

2

2222

1111

),;(logmax

),;(logmax

)log(max

2,

1,

Sjjj

jjj

ii j

ijp

xNp

xNp

ppi

Solve separately:

Same maximization holds.

In the EM algorithm we used:

),|( kjij xisPp

Deriving the EM formula.

In this case Q is dependent on the current parameters, so we call it:

What is missing? Q is not L!

)|( kQ

Page 14: Intro to Comp Genomics Lecture 5: Learning models using EM

Expectation-Maximization

),(),|()|,(

)|,(log)|(log

xPxhPxhP

xhPxPh

),|(log)|,(log)|(log xhPxhPxP

h h

kk xhPxhPxhPxhPxP ),|(log),|()|,(log),|()|(log

h

kk shPshPQ )|,(log),|()|(

h

kkkkk

k

shP

shPshPQQ

sPsP

),|(

),|(log),|()|()|(

)|(log)|(log

Relative entropy>=0EM maximization

)|(maxarg1 kk Q

Dempster

Page 15: Intro to Comp Genomics Lecture 5: Learning models using EM

KL-divergence

0)(min

log)1(log)

1()(max

)(log)()(

PH

Kk

Pk

PPH

xPxPPH

i

iii

Entropy (Shannon)

i i

ii xQ

xPxPQPD

)(

)(log)()||( Kullback-leibler divergence

iii

i i

ii

i i

ii xPxQ

xP

xQxP

xP

xQxPQPD 0)()(1

)(

)()(

)(

)(log)()|||(

1)log( uu

)||()||( PQDQPD

KL

Shannon

Not a metric!!

Page 16: Intro to Comp Genomics Lecture 5: Learning models using EM

Bayesian learning vs. Maximum likelihood

)|(maxarg DLML Maximum likelihood estimator

Introducing prior beliefs on the process(Alternatively: think of virtual evidence)Computing posterior probabilities on the parameters

Parameter Space

MLE

No prior beliefs

dD

DPME

MAP

)|(Pr

)Pr()|Pr(maxarg

Parameter Space

PME

Beliefs

MAP

Page 17: Intro to Comp Genomics Lecture 5: Learning models using EM

Preparations:• Get your hand on the ChIP-seq

profiles of CTCF and PolII in hg chr17• Cut the data into segments of 50,000

data points

Modeling:• Use EM to build a probabilistic model

for the peak signals and the background

• Use heuristics for peak finding to initialize the EM

Analysis:• Test if your model for single peak

structure is as good as the model for two peak structures.

• Compute the distribution of peaks relative to transcription start sites

Your Task

Preparations:

Background on ChIP-seq

CTCF and PolII

Modeling ChIP-seq, binning

Page 18: Intro to Comp Genomics Lecture 5: Learning models using EM

Your Task

B

P1

P2

P3

P..

Preparations:• Get your hand on the ChIP-seq

profiles of CTCF and PolII in hg chr17, bin-size = 50bp

• Cut the data into segments of 50,000 data points

Modeling:• Use EM to build a probabilistic model

for the peak signals and the background.

• Use heuristics for peak finding to initialize the EM

Analysis:• Test if your model for single peak

structure is as good as the model for two peak structures.

• Compute the distribution of peaks relative to transcription start sites

Your TaskModeling

),;()|( 111 xNPxP

),;()|( 222 xNPxP

),;()|( 333 xNPxP

),;()|( 444 xNPxP

),;()|( xNBxP

The model use k-states for the peak and one state for the backgroundUse K=40.

S

F

Page 19: Intro to Comp Genomics Lecture 5: Learning models using EM

Your Task

Preparations:• Get your hand on the ChIP-seq

profiles of CTCF and PolII in hg chr17, bin-size = 50bp

• Cut the data into segments of 50,000 data points

Modeling:• Use EM to build a probabilistic model

for the peak signals and the background.

• Use heuristics for peak finding to initialize the EM

Analysis:• Test if your model for single peak

structure is as good as the model for two peak structures.

• Compute the distribution of peaks relative to transcription start sites

Your TaskModeling

Implement HMM inference: forward-backward - let’s write them together

Make sure your total probability equals for the forward and backward algorithm!

Implement the EM update rules - let’s write them together

Run EM from multiple random points and record the likelihoods you derive

Implement smarter initialization: take the average values around all probes with value over a threshold.

Compute posterior peak probabilities: report all loci with P(Peak)>0.8

Page 20: Intro to Comp Genomics Lecture 5: Learning models using EM

Your Task

Preparations:• Get your hand on the ChIP-seq

profiles of CTCF and PolII in hg chr17, bin-size = 50bp

• Cut the data into segments of 50,000 data points

Modeling:• Use EM to build a probabilistic model

for the peak signals and the background.

• Use heuristics for peak finding to initialize the EM

Analysis:• Test if your model for single peak

structure is as good as the model for two peak structures.

• Compute the distribution of peaks relative to transcription start sites

Your TaskAnalysis

Compare the two peak structures you get (from CTCF and PolII)

Retrain a model together on the two datasets

Compute the log-likelihood of the unified model and compare to the sum of likelihood for the two models

Optional: test if the difference is significant by:-sampling data from the unified model-training two models on the synthetic data and compute the likelihood delta as for real data

-Use a set of known TSS to compute the distribution of peaks relative to genes