Upload
danil
View
56
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A+. C+. G+. T+. A-. C-. G-. T-. A modeling Example. CpG islands in DNA sequences. Methylation & Silencing. One way cells differentiate is methylation Addition of CH 3 in C-nucleotides Silences genes in region CG (denoted CpG) often mutates to TG, when methylated - PowerPoint PPT Presentation
Citation preview
A+ C+ G+ T+
A- C- G- T-
A modeling Example
CpG islands in DNA sequences
Methylation & Silencing
• One way cells differentiate is methylation Addition of CH3 in C-nucleotides Silences genes in region
• CG (denoted CpG) often mutates to TG, when methylated
• In each cell, one copy of X is silenced, methylation plays role
• Methylation is inherited during cell division
Example: CpG Islands
CpG nucleotides in the genome are frequently methylated
(Write CpG not to confuse with CG base pair)
C methyl-C T
Methylation often suppressed around genes, promoters CpG islands
Example: CpG Islands
• In CpG islands,
CG is more frequent
Other pairs (AA, AG, AT…) have different frequencies
Question: Detect CpG islands computationally
A model of CpG Islands – (1) Architecture
A+ C+ G+ T+
A- C- G- T-
CpG Island
Not CpG Island
A model of CpG Islands – (2) TransitionsHow do we estimate parameters of the model?
Emission probabilities: 1/0
1. Transition probabilities within CpG islands
Established from known CpG islands (Training Set)
2. Transition probabilities within other regions
Established from known non-CpG islands(Training Set)
Note: these transitions out of each state add up to one—no room for transitions between (+) and (-) states
+ A C G TA .180 .274 .426 .120
C .171 .368 .274 .188
G .161 .339 .375 .125
T .079 .355 .384 .182
- A C G TA .300 .205 .285 .210
C .233 .298 .078 .302
G .248 .246 .298 .208
T .177 .239 .292 .292
= 1
= 1
= 1
= 1
= 1
= 1
= 1
= 1
Log Likehoods—Telling “CpG Island” from “Non-CpG Island”
A C G T
A -0.740 +0.419 +0.580 -0.803
C -0.913 +0.302 +1.812 -0.685
G -0.624 +0.461 +0.331 -0.730
T -1.169 +0.573 +0.393 -0.679
Another way to see effects of transitions:
Log likelihoods
L(u, v) = log[ P(uv | + ) / P(uv | -) ]
Given a region x = x1…xN
A quick-&-dirty way to decide whether entire x is CpG
P(x is CpG) > P(x is not CpG)
i L(xi, xi+1) > 0
A model of CpG Islands – (2) Transitions
• What about transitions between (+) and (-) states?• They affect
Avg. length of CpG island
Avg. separation between two CpG islands
X Y
1-p
1-q
p q
Length distribution of region X:
P[lX = 1] = 1-pP[lX = 2] = p(1-p)…P[lX= k] = pk-1(1-p)
E[lX] = 1/(1-p)Geometric distribution, with mean
1/(1-p)
What if a new genome comes?
• We just sequenced the porcupine genome
• We know CpG islands play the same role in this genome
• However, we have no known CpG islands for porcupines
• We suspect the frequency and characteristics of CpG islands are quite different in porcupines
How do we adjust the parameters in our model?
LEARNING
Learning
Re-estimate the parameters of the model based on training data
Two learning scenarios
1. Estimation when the “right answer” is known
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islands
GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls
2. Estimation when the “right answer” is unknown
Examples:GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice
QUESTION:Update the parameters of the model to maximize P(x|)
1. When the right answer is known
Given x = x1…xN
for which the true = 1…N is known,
Define:
Akl = # times kl transition occurs in Ek(b) = # times state k in emits b in x
We can show that the maximum likelihood parameters (maximize P(x|)) are:
Akl Ek(b)akl = ––––– ek(b) = –––––––
i Aki c Ek(c)
1. When the right answer is known
Intuition: When we know the underlying states, Best estimate is the normalized frequency of transitions & emissions that occur in the training data
Drawback: Given little data, there may be overfitting:P(x|) is maximized, but is unreasonable0 probabilities – BAD
Example:Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 = F, F, F, F, F, F, F, F, F, F
Then:aFF = 1; aFL = 0eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1
Pseudocounts
Solution for small training sets:
Add pseudocounts
Akl = # times kl transition occurs in + rkl
Ek(b) = # times state k in emits b in x + rk(b)
rkl, rk(b) are pseudocounts representing our prior belief
Larger pseudocounts Strong priof belief
Small pseudocounts ( < 1): just to avoid 0 probabilities
Pseudocounts
Example: dishonest casino
We will observe player for one day, 600 rolls
Reasonable pseudocounts:
r0F = r0L = rF0 = rL0 = 1;rFL = rLF = rFF = rLL = 1;rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair)rL(1) = rL(2) = … = rL(6) = 5 (wait and see for loaded)
Above #s are arbitrary – assigning priors is an art
2. When the right answer is unknown
We don’t know the true Akl, Ek(b)
Idea:
• We estimate our “best guess” on what Akl, Ek(b) are Or, we start with random / uniform values
• We update the parameters of the model, based on our guess
• We repeat
2. When the right answer is unknown
Starting with our best guess of a model M, parameters :
Given x = x1…xN
for which the true = 1…N is unknown,
We can get to a provably more likely parameter set i.e., that increases the probability P(x | )
Principle: EXPECTATION MAXIMIZATION
1. Estimate Akl, Ek(b) in the training data2. Update according to Akl, Ek(b)3. Repeat 1 & 2, until convergence
Estimating new parameters
To estimate Akl: (assume “| CURRENT”, in all formulas below)
At each position i of sequence x, find probability transition kl is used:
P(i = k, i+1 = l | x) = [1/P(x)] P(i = k, i+1 = l, x1…xN) = Q/P(x)
where Q = P(x1…xi, i = k, i+1 = l, xi+1…xN) = = P(i+1 = l, xi+1…xN | i = k) P(x1…xi, i = k) = = P(i+1 = l, xi+1xi+2…xN | i = k) fk(i) = = P(xi+2…xN | i+1 = l) P(xi+1 | i+1 = l) P(i+1 = l | i = k) fk(i) = = bl(i+1) el(xi+1) akl fk(i)
fk(i) akl el(xi+1) bl(i+1)So: P(i = k, i+1 = l | x, ) = ––––––––––––––––––
P(x | CURRENT)
Estimating new parameters
• So, Akl is the E[# times transition kl, given current ]
fk(i) akl el(xi+1) bl(i+1)
Akl = i P(i = k, i+1 = l | x, ) = i –––––––––––––––––
P(x | )
• Similarly,
Ek(b) = [1/P(x | )] {i | xi = b} fk(i) bk(i)
k l
xi+1
akl
el(xi+1)
bl(i+1)fk(i)
x1………xi-1 xi+2………xN
xi
The Baum-Welch Algorithm
Initialization:Pick the best-guess for model parameters
(or arbitrary)
Iteration:1. Forward2. Backward3. Calculate Akl, Ek(b), given CURRENT
4. Calculate new model parameters NEW : akl, ek(b)5. Calculate new log-likelihood P(x | NEW)
GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION
Until P(x | ) does not change much
The Baum-Welch Algorithm
Time Complexity:
# iterations O(K2N)
• Guaranteed to increase the log likelihood P(x | )
• Not guaranteed to find globally best parameters
Converges to local optimum, depending on initial conditions
• Too many parameters / too large model: Overtraining
Alternative: Viterbi Training
Initialization: Same
Iteration:1. Perform Viterbi, to find *
2. Calculate Akl, Ek(b) according to * + pseudocounts3. Calculate the new parameters akl, ek(b)
Until convergence
Notes: Not guaranteed to increase P(x | ) Guaranteed to increase P(x | , *) In general, worse performance than Baum-Welch