View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Cis-regultory module
10/24/07
TFs often work synergistically
(Harbison 2004)
Combinatorial control
lysogenic growth
lytic growth
(source: Gary Kaiser)
-phase
E coli
ORcI cro
-operon
ORcI cro
-operon
on off
lysogenic growth
ORcI cro
-operon
off on
lytic growth
OR1OR2OR3
cro
-operon
cI
Pol II
lysogenic
crocI
Pol II
lytic
Cis-regulatory module (CRM)
• “A CRM is a DNA segment, typically a few hundred base pairs in length containing multiple binding sites, that recruits several cooperating factors to a particular genomic location” – Ji and Wong (2006)
Statistical Methods
• Predict modules when the motifs are known. (simpler)– LRA, by Wasserman and Fickett (1998)
• Predict modules when the motifs also need to be discovered. (more difficult)– CisModule, by Zhou and Wong (2004)– EMCModule, by Gupta and Liu (2005)
LRA
LRA
Cooperative motifs:
Basic idea: True regulatory regions are likely to have multiple motif sites. P
roba
bilit
y fo
r be
ing
regu
lato
ry
LRA
• Training data contain a subset of known regulatory and control regions.
p
pp
1log)(logit
nnxxp ...)(logit 110
highest motif matching score within a given sequence
regression coefficient
Probability for being a regulatory
region
Application: skeletal-muscle gene regulation
• 5 muscle-specific TFs are known:– Mef-2, Myf, SRF, Tef, Sp-1
• 29 regulatory regions are known.
• Can we predict the regulatory regions just from sequence motif information?
Computational Procedure
• Motif matrices are identified by Gibbs sampling using sequence information from the 29 regulatory regions.
• For some TF, motifs cannot be found by the de novo approach. Use literature motifs instead.
• Top two matching scores for each TF are included as covariates.
• Apply LRA model. Use leave-one-out cross-validation to evaluate model performance.
Results
•Single motifs are highly non-specific.
•Simple multi-sites analysis improves specificity at the cost of reducing sensitivity.
Results
•Single motifs are highly non-specific.
•Simple multi-sites analysis improves specificity at the cost of reducing sensitivity.
Results
•Single motifs are highly non-specific.
•Simple multi-sites analysis improves specificity at the cost of reducing sensitivity.
•Logistic regression further improves specificity at reduced cost for sensitivity.
• Motifs must be known in advance.
• When known regulatory sequences are few, it is difficult to identify motifs by using traditional methods.
Objective:
• Integrating motif discovery and module finding in a single statistical model.
Limitations of LRA
De novo module identification
Two tasks
• Identify TF motifs
• Identify CRMs.
Why module approach can help motif discovery
•Due to poor specificity, a short sequence can be enriched simply by chance.
•The probability for random matches is much smaller for motif co-occurrence.
cisModule
Basic idea:• A two-level
hierarchical mixture model (HMx).– Level 1: modules
sequences
(Zhou and Wong 2004)
cisModule
Basic idea:• A two-level
hierarchical mixture model (HMx).– Level 1: modules
sequences– Level 2: motifs
modules
(Zhou and Wong 2004)
• Treat HMx model as a stochastic machinery to generate sequences.– From the first sequence position, make a series of random
decisions of whether to initiate a module of length l or generate a letter from the background model.
– Inside a module, If a site for the kth motif was initiated at position n, then generate wk letters from its PWM and place them at [n, n+wk-1], otherwise generate a letter from the background.
– After reaching the end of the current module, decide whether sampling from the background or initiating a new module.
HMx Model as a Stochastic Process
(Zhou and Wong 2004)
given alignment, update model parameters
given model parameters, update module/motif locations
Model inference: Gibbs sampling
An numerical experiment
• Merge the 29 regulatory regions with a set of sequences randomly selected from ENSEMBL promoters.
• Test the ability of cisModule to identify motifs under “noisy” environment.
Results
Limitations of CisModule
• The length of module, and number of motifs are externally provided.
• Convergence time could be slow. Multiple cycles are needed each starting from a new seed.
• Assuming that combinations of different motifs are independent.
EMCModule
• Gupta and Liu (2005) developed a similar approach called EMCModule.
• Main difference:– They use the collection of literature motifs as initial
“seeds” for motif discovery. – Their method improves the convergence speed.– Their definition of CRMs are a little different: the
number of motifs are fixed within one module, but the order of and distance between different motifs can be varied.
Further issues
• Comparative genomic approach can also be incorporated into module discovery. (Zhou and Wong 2007).
• The modules identified by these methods can be viewed as belonging to one “type”. New methods need to developed to discover multiple module types.
• While module-based approach is helpful for finding cooperative motifs, it may hurt discovery of single motifs.
(Yuh et al. 1998)
(Yuh et al. 1998)
(Yuh et al. 1998)
(Yuh et al. 1998)
Reading List
• Wasserman and Fickett (1988)– LRA. One of the first work on cis-regulatory modules.
• Zhou and Wong (2004)– cisModule. A statistical method to identify cis-
regulatory modules without knowledge of motif information.
• Yuh et al. (1998)– An influential biological paper on how information can
be integrated from different modules to regulate gene expression.