View
223
Download
1
Embed Size (px)
Citation preview
Hidden Markov Model Special case of Dynamic Bayesian network
Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse
Usually encoded by a state transition graph
S S’
O’
G G0 Unrolled network
S0
O0
S0 S1
O1
S2
O2
S3
O3
Hidden Markov Model Special case of Dynamic Bayesian network
Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse
Usually encoded by a state transition graph
S1
S2
S3
S4
s1 s2 s3 s4
s1 0.2 0.8 0 0
s2 0 0 1 0
s3 0.4 0 0 0.6
s4 0 0.5 0 0.5
P(S’|S)
State transition representation
Joint Probability Distribution
Unrolled network
S0 S1
O1
S2
O2
S3
O3
n
i
iiii SOPSSPSPOSP1
10 )|()|()(),(
Exact Inference Variable Elimination
Inference in a simple chain Computing P(X2)
X1 X2
11
)|()(),()( 121212xx
xXPxPXxPXP
All the numbers for this computation are in the CPDs of the original Bayesian network
O(|X1||X2|) operations
X3
Exact Inference Variable Elimination
Inference in a simple chain Computing P(X2)
Computing P(X3)
X1 X2
11
)|()(),()( 121212xx
xXPxPXxPXP
X3
22
)|()(),()( 232323xx
xXPxPXxPXP
P(X3|X2) is a given CPD
P(X2) was computed above
O(|X1||X2|+|X2||X3|) operations
Exact Inference Variable Elimination
Inference in a general chain Computing P(Xn)
Compute each P(Xi+1) from P(Xi) k2 operations for each computation (assuming |Xi|=k) O(nk2) operations for the inference Compare to kn operations required in summing over all
possible entries in the joint distribution over X1,...Xn
Inference in a general chain can be done in linear time!
X1 X2 X3 Xn...
Exact Inference Variable Elimination
X1 X2
1 2 3
),,,()( 43214X X X
XXXXPXP
X3 X4
1 2 3
)|()|()|()( 3423121X X X
XXPXXPXXPXP
3 2 1
)|()()|()|( 1212334X X X
XXPXPXXPXXP
3 2
)()|()|( 22334X X
XXXPXXP
3
)()|( 334X
XXXP
)( 4X
Pushing summations = Dynamic programming
Inference
Unrolled network
S0 S1
O1
S2
O2
S3
O3
1010 ,,,,,
1
1
10 )|()|()()( ii OOSSx
i
j
jjjji SOPSSPSPSP
Computing P(Si)
0
1011 )|()()|()|()|( 010
,,,,,
1
2
111
SxOOSSx
i
j
jjjj SSPSPSOPSSPSOPii
)()|()|()|( 1
,,,,,
1
2
1111011 SFSOPSSPSOPii OOSSx
i
j
jjjj
Inference
1110 ,,,,,
1
1
10 )|()|()()( ii OOSSx
i
j
jjjji SOPSSPSPSP
Computing P(Si)
0
1111 )|()()|()|()|( 010
,,,,,
1
2
111
SxOOSSx
i
j
jjjj SSPSPSOPSSPSOPii
)()|()|()|( 1
,,,,,
1
2
1111111 SFSOPSSPSOPii OOSSx
i
j
jjjj
1
1112 )()|()|()|()|( 11112
,,,,,
1
2
1
SxOOSSx
i
j
jjjj SFSOPSSPSOPSSPii
),()|()|( 21
,,,,,
1
2
11112 SOFSOPSSPii OOSSx
i
j
jjjj
1
1212 ),()|()|( 21
,,,,,
1
2
1
OxOOSSx
i
j
jjjj SOFSOPSSPii
)()|()|( 2
,,,,,
1
2
11212 SFSOPSSPii OOSSx
i
j
jjjj
Inference: Forward-Backward Algorithm
),,(
),,,(),,|(
1
11
n
nini
OOP
OOSPOOSP
Computing P(Si|O1,...,On)
),,(
),,,|,,(),,,(1
1111
n
iiniii
OOP
OOSOOPSOOP
),,(
)|,,(),,,(1
11
n
iniii
OOP
SOOPSOOP
Forward Backward
Normalization factor
Computing the Forward Step
),,,()( 11 jSOOPi iij Define
)()0( 0 jSPj Initialization:
Induction step:),,,()1( 11 jSOOPi ii
j
x
iii jSxSOOP ),,,,( 11
x
iiiiii xSOOjSOPxSOOP ),,,|,(),,,( 11111
x
iiiii xSjSOPxSOOP )|,(),,,( 111
x
iiiix xSjSPxSOPi )|()|()( 1
Computing the Backward Step
)|,,()( jSOOPi inij Define
1)1( njInitialization:
Induction step:)|,,()( jSOOPi ini
j
x
iini jSxSOOP )|,,,( 1
x
iiiniiii jSxSOOOPjSxSOP ),,|,,()|,( 111
x
iniiiiii xSOOPjSOxSPjSOP )|,,(),|()|( 111
xx
iiii ijSxSPjSOP )1()|()|( 1
Computing Evidence Probability
),,,()( 11 jSOOPi iij Since
x
nn xSOOPOP ),,,()( 1 Then:
x
x n)(
Since
x
n xSOOPxSPOP )|,,()()( 111 Then:
x
xxSP )1()( 1
)|,,()( jSOOPi inij
Assignment 3 Part 1: Constructing and evaluating a
nucleosome probability model Model 1: zero order Markov model Model 2: first order Markov model
Both models have two components: PN: Position-dependent distribution over nucleotides PL: Position-independent distribution over nucleotides P=PN/PL
Assignment 3 PN:
Markov order 0: Markov order 1:
Estimating PN
Create an alignment from all nucleosome reads and the reverse complement of each read
Estimate PN,i from counts in the data Example for Markov order 1:
where #(Sk=i|Sk-1=j) is the number of times that the nucleotide at position k in the alignment is i, AND the nucleotide at position k-1 in the alignment is j
147
2
1,
11, )|()()(
i
iiiNNN SSPSPSP
147
1, )()(
i
iiNN SPSP
x
kk
kkkk
kN jSxS
jSiSjSiSP
)|(#
)|(#)|(
1
11
,
Assignment 3 PL:
Markov order 0: Markov order 1:
Estimating PL
For Markov order 0: compute the average number of reads that cover each of the possible 4 basepairs in the genome
For Markov order 1: compute the average number of reads that cover each of the possible 16 dinucleotides in the genome
Estimate PL from counts in the data Example for Markov order 1:
where A(Sk=i|Sk-1=j) is the average coverage of the dinucleotide i,j, computed as explained above
147
2
11 )|()()(i
iiLLL SSPSPSP
147
1
)()(i
iLL SPSP
x
kk
kkkk
L jSxSA
jSiSAjSiSP
),(
),()|(
1
11
Assignment 3 Evaluating the model
Construct the model in a cross validation scheme, i.e., create it only using the data of chromosomes 1-8
Test the model (order 0 & 1) on the held-out chromosomes
Compute the log-likelihood of all held-out nucleosome reads (work in log-space!)
Compare to the log-likelihood of a random selection of sequences from the genome
Compare to the log-likelihood of permutations of the sequences
)(log)(log)(log ii SPSPSP
Assignment 3 Evaluating the model (cont.)
Test the model (order 0 & 1) on the held-out chromosomes
Create an ROC evaluation Select a threshold t, equal to the average number of reads per
basepair in the genome Define ‘positive’ regions as maximal contiguous regions in which
every basepair is above t. Remove regions whose size is <50bp Define ‘negative’ regions as maximal contiguous regions in
which every basepair is below t. Remove regions whose size is <50bp
Use the model to score each region, as the average score of the basepairs it contains, where the score of each basepair is the average score of all 147 scores that cover that basepair
Create an ROC score using these positive and negative regions. This is done by ranking all regions according to the model scores (above), and plotting, at each rank, the false positive rate (x-axis) vs. true positive rate (y-axis)
Compute the AUC (area under the curve)
Assignment 3 Use the model in an HMM framework and
compute the average nucleosome occupancy at each basepair Easiest to view as a generalized HMM with two states
Si=0: no nucleosome starts at position i Si=1: nucleosome starts at position i
Notes Emission probability given S=1 is taken from nucleosome
model Emission probability given S=0 is uniform over all basepairs Placing a nucleosome ‘emits’ 147 basepairs Implement a uniform non-normalized transition probability
between the two states, i.e., W(S=0)=1, W(S=1)=1
Compute P(Si=0|O) and P(Si=1|O) for every basepair Compute the average occupancy at each basepair as
i
ij
i OSPiP146
)|1()(
Assignment 3 Evaluating the HMM model
Generate a plot of average occupancy of the real data and the model predictions at a 2000bp region of your choice
Perform the same ROC analysis as with the previous model, except that scores of the positive and negative regions are now computed as the average nucleosome occupancy of those regions according to your genome-wide computation