Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)

Cis-regultory module

10/24/07

TFs often work synergistically

(Harbison 2004)

Combinatorial control

lysogenic growth

lytic growth

(source: Gary Kaiser)

-phase

E coli

ORcI cro

-operon

ORcI cro

-operon

on off

lysogenic growth

ORcI cro

-operon

off on

lytic growth

OR1OR2OR3

cro

-operon

cI

Pol II

lysogenic

crocI

Pol II

lytic

Cis-regulatory module (CRM)

• “A CRM is a DNA segment, typically a few hundred base pairs in length containing multiple binding sites, that recruits several cooperating factors to a particular genomic location” – Ji and Wong (2006)

Statistical Methods

• Predict modules when the motifs are known. (simpler)– LRA, by Wasserman and Fickett (1998)

• Predict modules when the motifs also need to be discovered. (more difficult)– CisModule, by Zhou and Wong (2004)– EMCModule, by Gupta and Liu (2005)

LRA

LRA

Cooperative motifs:

Basic idea: True regulatory regions are likely to have multiple motif sites. P

roba

bilit

y fo

r be

ing

regu

lato

ry

LRA

• Training data contain a subset of known regulatory and control regions.

p

pp

1log)(logit

nnxxp ...)(logit 110

highest motif matching score within a given sequence

regression coefficient

Probability for being a regulatory

region

Application: skeletal-muscle gene regulation

• 5 muscle-specific TFs are known:– Mef-2, Myf, SRF, Tef, Sp-1

• 29 regulatory regions are known.

• Can we predict the regulatory regions just from sequence motif information?

Computational Procedure

• Motif matrices are identified by Gibbs sampling using sequence information from the 29 regulatory regions.

• For some TF, motifs cannot be found by the de novo approach. Use literature motifs instead.

• Top two matching scores for each TF are included as covariates.

• Apply LRA model. Use leave-one-out cross-validation to evaluate model performance.

Results

•Single motifs are highly non-specific.

•Simple multi-sites analysis improves specificity at the cost of reducing sensitivity.

Results



Results



•Logistic regression further improves specificity at reduced cost for sensitivity.

• Motifs must be known in advance.

• When known regulatory sequences are few, it is difficult to identify motifs by using traditional methods.

Objective:

• Integrating motif discovery and module finding in a single statistical model.

Limitations of LRA

De novo module identification

Two tasks

• Identify TF motifs

• Identify CRMs.

Why module approach can help motif discovery

•Due to poor specificity, a short sequence can be enriched simply by chance.

•The probability for random matches is much smaller for motif co-occurrence.

cisModule

Basic idea:• A two-level

hierarchical mixture model (HMx).– Level 1: modules

sequences

(Zhou and Wong 2004)

cisModule

Basic idea:• A two-level

hierarchical mixture model (HMx).– Level 1: modules

sequences– Level 2: motifs

modules


• Treat HMx model as a stochastic machinery to generate sequences.– From the first sequence position, make a series of random

decisions of whether to initiate a module of length l or generate a letter from the background model.

– Inside a module, If a site for the kth motif was initiated at position n, then generate wk letters from its PWM and place them at [n, n+wk-1], otherwise generate a letter from the background.

– After reaching the end of the current module, decide whether sampling from the background or initiating a new module.

HMx Model as a Stochastic Process


given alignment, update model parameters

given model parameters, update module/motif locations

Model inference: Gibbs sampling

An numerical experiment

• Merge the 29 regulatory regions with a set of sequences randomly selected from ENSEMBL promoters.

• Test the ability of cisModule to identify motifs under “noisy” environment.

Results

Limitations of CisModule

• The length of module, and number of motifs are externally provided.

• Convergence time could be slow. Multiple cycles are needed each starting from a new seed.

• Assuming that combinations of different motifs are independent.

EMCModule

• Gupta and Liu (2005) developed a similar approach called EMCModule.

• Main difference:– They use the collection of literature motifs as initial

“seeds” for motif discovery. – Their method improves the convergence speed.– Their definition of CRMs are a little different: the

number of motifs are fixed within one module, but the order of and distance between different motifs can be varied.

Further issues

• Comparative genomic approach can also be incorporated into module discovery. (Zhou and Wong 2007).

• The modules identified by these methods can be viewed as belonging to one “type”. New methods need to developed to discover multiple module types.

• While module-based approach is helpful for finding cooperative motifs, it may hurt discovery of single motifs.

(Yuh et al. 1998)

(Yuh et al. 1998)

(Yuh et al. 1998)

(Yuh et al. 1998)

Reading List

• Wasserman and Fickett (1988)– LRA. One of the first work on cis-regulatory modules.

• Zhou and Wong (2004)– cisModule. A statistical method to identify cis-

regulatory modules without knowledge of motif information.

• Yuh et al. (1998)– An influential biological paper on how information can

be integrated from different modules to regulate gene expression.

Documents

Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)