View
219
Download
4
Embed Size (px)
Citation preview
Intro to Comp Genomics
Lecture 5: Learning models using EM
Mixtures of Gaussians
i
iii xNpxP ),;()|(
),;()|( xNxP
i
iii xNpxP ),;()|(
We have experimental results of some valueWe want to describe the behavior of the experimental values:Essentially one behavior? Two behaviors? More?In one dimension it may look very easy: just looking at the distribution will give us a good idea..
We can formulate the model probabilistically as a mixture of normal distributions.
As a generative model: to generate data from the model, we first select the sub-model by sampling from the mixture variable. We then generate a value using the selected normal distribution.
If the data is multi dimensional, the problem is becoming non trivial.
Inference
i
iii xNpxP ),;()|(
Let’s represent the model as:
i
iii xNpxP ),;()|(
iiiisss
s
xNpxNp
sxssxsxsP
),;(/),;(
)'|Pr()'Pr(/)|Pr()Pr()|(
What is the inference problem in our model?
s
sxsx )|Pr()Pr()Pr(
Inference: computing the posterior probability of a hidden variable given the data and the model parameters.
For p0=0.2, p1=0.8, 0=0, 1=1, 0=1,1=0.2, what is Pr(s=0|0.8) ?
Estimation/parameter learning
i
iii xNpxP ),;()|(
Given data, how can we estimate the model parameters?
i jjjij
ii
n
xNp
xxxL
),;(
)|Pr(),..,|( 1
Transform it into an optimization problem!
Likelihood: a function of the parameters. Defined given the data.
Find parameters that maximize the likelihood: the ML problem
Can be approached heuristically: using any optimization technique.
But it is a non linear problem which may be very difficult
Generic optimization techniques:
Gradient ascent:
Find
Simulation annealing
Genetic algorithms
And more..
)),..,|((maxarg 11
nkk
ak xxLaL
The EM algorithm for mixtures
i
iii xNpxP ),;()|(
We start by guessing parameters:
We now go over the samples and compute their posteriors (i.e., inference):
iis iissxNpxNpxsP ),;(/),;(),|( 00000
We use the posteriors to compute new estimates for the expected sufficient statistics of each distribution, and for the mixture coefficients:
ii
iii
xsP xsP
xsPxxE
s ),|(
),|(][
0
0
)|(1
ii
iisi
xsP xsP
xsPxxV
s ),|(
),|()(][
0
021
)|(
21
i
ixsPNps
),|(1 01
Continue iterating until convergence.
The EM theorem: the algorithm will converge and will improve likelihood monotonically
But:
No Guarantee of finding the optimumOr of finding anything meaningful
The initial conditions are critical:
Think of starting from 0=0, 1=10, 1,2=1
Solutions: start from “reasonable” solutionsTry many starting points
-1 0 1
Hidden Markov Models
Observing only emissions of states to some probability space EEach state is equipped with an emission distribution (x a state, e emission))|Pr( xe
)|Pr()|Pr(),Pr( 1 iiiii sesses
Emission space
Caution! This is NOT the HMM Bayes Net
1.Cycles2.States are NOT random vars!
Simple example: Mixture with “memory”
h i
iiio
h
hxPhhhxPh
hxP
xP
)|()|Pr()|()Pr(
)|,(
)|(
10
We sample a sequence of dependent valuesAt each step, we decide if we continue to sample from the same distribution or switch with probability p
We can compute the probability directly only given the hidden variables.
P(x) is derived by summing over all possible combination of hidden variables. This is another form of the inference problem (why?)
There is an exponential number of h assignments, can we still solve the problem efficiently?
B A
)|( ABP
)|( BAP
)|( AxP )|( BxP
Inference in HMM
Forward formula:
0:1?)(
)'|Pr()|Pr(
0
)('
1'
startsf
ssfsef
s
sNs
is
iis
)|Pr(
)'|Pr()|'Pr( 1
)('
1'
sfinishb
sessbb
Ns
i
sNs
is
is
Backward formula: Emissions
States FinishStart
isf
Emissions
States FinishStart
isb
S
Ls sfinishfL )|Pr(
S
s beginsbL )|Pr(1
Inference in HMM
Forward formula:
0:1?)(
)'|Pr()|Pr(
0
)('
1'
startsf
ssfsef
s
sNs
is
iis
)|Pr(
)'|Pr()|'Pr( 1
)('
1'
sfinishb
sessbb
Ns
i
sNs
is
is
Backward formula: Emissions
States FinishStart
isf
Emissions
States FinishStart
1isb
S
Ls sfinishfL )|Pr(
S
s beginsbL )|Pr(1
EM for HMMsEmissions
States FinishStart
i
kkiis
is
k sssebfsL
ss ),'|Pr(),'|Pr()(
1~),'|Pr( 1
'1
The posterior probability for transition from s’ to s after character i?
With multiple sequence, assume independence (accumulate stats)
Claim: HMM EM is monotonically improving the likelihood
The posterior probability for emitting the i’th character from state s?
)Pr()(
1~),|Pr( 1 eebf
sLse i
i
is
is
k
The EM theorem for mixtures simplified
Assume that we know which distribution generated each sample (samples Si generated from distribution i)
We want to maximize the model’s likelihood, given this extra information:
i Sj
jSi
i
ii
i xNpxL ),;()|( ||
ii
iii
xsP xsP
xsPxxE
s ),|(
),|(][
0
0
)|(1
ii
iisi
xsP xsP
xsPxxV
s ),|(
),|()(][
0
021
)|(
21
i
ixsPNps
),|(1 01
i Sjj
ii
i
i
iixN
pSxLL
),;(log
)log(||)|(
2
2222
1
1111
),;(logmax
),;(logmax
)log(||max
,
,
Sjj
Sjj
ii
ip
xN
xN
pSi
Solve separately:
N
SpS ii
ii
pi
||)log(||maxarg
“multinomial estimator” solve using Lagrange multipliers:
ii
i
iii
ii
iii
ii
p
pp
L
ppSL
ppSi
1,0
)1()log(||
1,)log(||max
The EM theorem for mixtures simplified
Assume that we know which distribution generated each sample (samples Si generated from distribution i)
We want to maximize the model’s likelihood, given this extra information:
i Sj
jSi
i
ii
i xNpxL ),;()|( ||
ii
iii
xsP xsP
xsPxxE
s ),|(
),|(][
0
0
)|(1
ii
iisi
xsP xsP
xsPxxV
s ),|(
),|()(][
0
021
)|(
21
i
ixsPNps
),|(1 01
i Sjj
ii
i
i
iixN
pSxLL
),;(log
)log(||)|(
2
2222
1
1111
),;(logmax
),;(logmax
)log(||max
,
,
Sjj
Sjj
ii
ip
xN
xN
pSi
Solve separately:
Normal distribution estimator: using observed sufficient statistics (an exponential family)
][],[
),;(logmaxarg1
11,
xVxE
xN
ii
ii
SS
Sjj
We found the global optimum of the likelihood in the case of full data.
The EM theorem for mixtures simplified
Assume now that each sample i is known to be from distribution j with probability Pij. We can write down:
ij
ii
p
j iji xNpExpQ )),;(()(
ii
iii
xsP xsP
xsPxxE
s ),|(
),|(][
0
0
)|(1
ii
iisi
xsP xsP
xsPxxV
s ),|(
),|()(][
0
021
)|(
21
i
ixsPNps
),|(1 01
i jjij
ii
jij
iixNp
ppxQ
),;(log
)log()|(
2
2222
1111
),;(logmax
),;(logmax
)log(max
2,
1,
Sjjj
jjj
ii j
ijp
xNp
xNp
ppi
Solve separately:
Same maximization holds.
In the EM algorithm we used:
),|( kjij xisPp
Deriving the EM formula.
In this case Q is dependent on the current parameters, so we call it:
What is missing? Q is not L!
)|( kQ
Expectation-Maximization
),(),|()|,(
)|,(log)|(log
xPxhPxhP
xhPxPh
),|(log)|,(log)|(log xhPxhPxP
h h
kk xhPxhPxhPxhPxP ),|(log),|()|,(log),|()|(log
h
kk shPshPQ )|,(log),|()|(
h
kkkkk
k
shP
shPshPQQ
sPsP
),|(
),|(log),|()|()|(
)|(log)|(log
Relative entropy>=0EM maximization
)|(maxarg1 kk Q
Dempster
KL-divergence
0)(min
log)1(log)
1()(max
)(log)()(
PH
Kk
Pk
PPH
xPxPPH
i
iii
Entropy (Shannon)
i i
ii xQ
xPxPQPD
)(
)(log)()||( Kullback-leibler divergence
iii
i i
ii
i i
ii xPxQ
xP
xQxP
xP
xQxPQPD 0)()(1
)(
)()(
)(
)(log)()|||(
1)log( uu
)||()||( PQDQPD
KL
Shannon
Not a metric!!
Bayesian learning vs. Maximum likelihood
)|(maxarg DLML Maximum likelihood estimator
Introducing prior beliefs on the process(Alternatively: think of virtual evidence)Computing posterior probabilities on the parameters
Parameter Space
MLE
No prior beliefs
dD
DPME
MAP
)|(Pr
)Pr()|Pr(maxarg
Parameter Space
PME
Beliefs
MAP
Preparations:• Get your hand on the ChIP-seq
profiles of CTCF and PolII in hg chr17• Cut the data into segments of 50,000
data points
Modeling:• Use EM to build a probabilistic model
for the peak signals and the background
• Use heuristics for peak finding to initialize the EM
Analysis:• Test if your model for single peak
structure is as good as the model for two peak structures.
• Compute the distribution of peaks relative to transcription start sites
Your Task
Preparations:
Background on ChIP-seq
CTCF and PolII
Modeling ChIP-seq, binning
Your Task
B
P1
P2
P3
P..
Preparations:• Get your hand on the ChIP-seq
profiles of CTCF and PolII in hg chr17, bin-size = 50bp
• Cut the data into segments of 50,000 data points
Modeling:• Use EM to build a probabilistic model
for the peak signals and the background.
• Use heuristics for peak finding to initialize the EM
Analysis:• Test if your model for single peak
structure is as good as the model for two peak structures.
• Compute the distribution of peaks relative to transcription start sites
Your TaskModeling
),;()|( 111 xNPxP
),;()|( 222 xNPxP
),;()|( 333 xNPxP
),;()|( 444 xNPxP
),;()|( xNBxP
The model use k-states for the peak and one state for the backgroundUse K=40.
S
F
Your Task
Preparations:• Get your hand on the ChIP-seq
profiles of CTCF and PolII in hg chr17, bin-size = 50bp
• Cut the data into segments of 50,000 data points
Modeling:• Use EM to build a probabilistic model
for the peak signals and the background.
• Use heuristics for peak finding to initialize the EM
Analysis:• Test if your model for single peak
structure is as good as the model for two peak structures.
• Compute the distribution of peaks relative to transcription start sites
Your TaskModeling
Implement HMM inference: forward-backward - let’s write them together
Make sure your total probability equals for the forward and backward algorithm!
Implement the EM update rules - let’s write them together
Run EM from multiple random points and record the likelihoods you derive
Implement smarter initialization: take the average values around all probes with value over a threshold.
Compute posterior peak probabilities: report all loci with P(Peak)>0.8
Your Task
Preparations:• Get your hand on the ChIP-seq
profiles of CTCF and PolII in hg chr17, bin-size = 50bp
• Cut the data into segments of 50,000 data points
Modeling:• Use EM to build a probabilistic model
for the peak signals and the background.
• Use heuristics for peak finding to initialize the EM
Analysis:• Test if your model for single peak
structure is as good as the model for two peak structures.
• Compute the distribution of peaks relative to transcription start sites
Your TaskAnalysis
Compare the two peak structures you get (from CTCF and PolII)
Retrain a model together on the two datasets
Compute the log-likelihood of the unified model and compare to the sum of likelihood for the two models
Optional: test if the difference is significant by:-sampling data from the unified model-training two models on the synthetic data and compute the likelihood delta as for real data
-Use a set of known TSS to compute the distribution of peaks relative to genes