Upload
madlyn-lee
View
235
Download
4
Embed Size (px)
Citation preview
Genome Evolution. Amos Tanay 2012
Beyond Protein Coding Sequences
Non coding fraction of the genome:• E. coli : 12%• Yeast : 27%• Fly : 76%• Human : 97.6%
How biological functions of non-coding sequence can be defined?
Genome Evolution. Amos Tanay 2012
Sequence specific transcription factors• Sequence specific transcription factors (TFs) are a critical part of any gene activation or gene
repression machinery• TFs include a DNA binding domain that recognize specifically “regulatory elements” in the
genome.• The TF-DNA duplex is then used to target larger transcriptional structure to the genomic
locus.
Lactose Repressor
Genome Evolution. Amos Tanay 2012
Sequence specificity is represented using consensus sequences or weight matrices
• The specificity of the TF binding is central to the understanding of the regulatory relations it can form.
• We are therefore interested in defining the DNA motifs that can be recognize by each TF.• A simple representation of the binding motif is the consensus site, usually derived by
studying a set of confirmed TF targets and identifying a (partial) consensus. Degeneracy can be introduced into the consensus by using N letters (matching any nucleotide) or IUPAC characters (representing pairs of nucleotides, for examlpe W=[A|T], S=[C|G]
• A more flexible representation is using weight matrices (PWM/PSSM):
• PWMs are frequently plotted using motif logos, in which the height of the character correspond to its probability, scaled by the position entropy
ACGCGTACGCGAACGCATTCGCGATAGCGT
1 2 3 4 5 6
A 60% 20% 0 0 20% 40%
C 0 80% 0 100% 0 0
G 0 0 100% 0 80% 0
T 40% 0 0 0 0 60%
Genome Evolution. Amos Tanay 2012
In vitro TF binding energy is approximated by weight matrices
Yeast Leu3 data (Liu and Clarke, JMB 2002)
We can interpret weight matrices as energy functions:
])[log(][
][)(
iiii
iii
spsw
swsE
This linear approximation is reasonable for most TFs.
Genome Evolution. Amos Tanay 2012
• s
In-vivo TF binding affinity is approximated by weight matrices
• s
Ume6
ChIP ranges
11.5
5.5
Av
era
ge
PW
M e
ne
rgy
Stronger bindingS
tronger p
redictionTanay. Genome Res 2006
Cross-link and sheer
ImmunoPrecipitation
Chromatin ImmunoPrecipitation (ChIP)
Genome Evolution. Amos Tanay 2012
TF binding affinity is kinetically important, with possible functional implications
Kalir et al. Science 2001
Genome Evolution. Amos Tanay 2012
TFs are present at only a fraction of their optimal sequence targets. Binding is regulated by co-factors, nucleosomes and histone modifications
Heinzman et al. Nature Genetics, 2007)
Genome Evolution. Amos Tanay 2012
TFs are present at only a fraction of their optimal sequence targets. Binding is regulated by co-factors, nucleosomes and histone modifications
Heinzman et al. Nature Genetics, 2007)
Genome Evolution. Amos Tanay 2012
Specific proteins are identifying enhancersHere are studies of p300 binding in the developing mouse brain
(visel et al. Nature 2009)
Genome Evolution. Amos Tanay 2012
TFBSs are clustered in promoters or in “sequence modules”
• The distribution of binding sites in the genome is non uniform• In small genomes, most sites are in promoters, and there is a bias toward nucleosome free
region near the TSS• In larger genomes (fly) we observe CRM (cis-regulatory-modules) which are frequently away
from the TSS. These represent enhancers.• A single binding site, without the context of other co-sites, is unlikely to represent a
functional loci
Genome Evolution. Amos Tanay 2012
Discriminative scores for motifs
• So far we used a generative probabilistic model to learn PWMs• The model was designed to generate the data from parameters• We assumed that TFBSs are distributed differently than some fixed background model
• If our background model is wrong, we will get the wrong motifs..
• A different scoring approach try to maximize the discriminative power of the motif model.• We will not go here into the details of discriminative vs. generative models, but we shall
exemplify the discriminative approach for PWMs.
Lousy discriminator High specificity discriminator High sensitivity discriminator
Genome Evolution. Amos Tanay 2012
Hypergeometric scores and thresholding PWMs
||
||
||||
)|(|
B
n
kB
An
k
A
kBAP
PWM score threshold
Nu
mb
er
of
seq
ue
nce
s
Positive
True positive
For a discriminative score, we need to decide on both the PWM model and the threshold.
Hyper geometric probability
(sum for j>=k is the hg p-value)
Genome Evolution. Amos Tanay 2012
Constructing a weight matrix from aligned TFBSs is trivial
• This is done by counting (or “voting”)• Several databases (e.g., TRANSFAC, JASPAR) contain matrices that were
constructed from a set of curated and validated binding site• Validated site: usually using “promoter bashing” – testing reported
constructs with and without the putative site
Transfac 7.0/11.3 have 400/830 different PWMs, based on more than 11,000 papers
However, there are no real different 830 matrices out there – the real binding repertoire in nature is still somewhat unclear
Genome Evolution. Amos Tanay 2012
High density arrays quantify TF binding preferences and identify binding sites in high throughput
• Using microarrays (high resolution tiling arrays) we can now map binding sites in a genome-wide fashion for any genome
• The problem is shifting from identifying binding sites to understanding their function and determining how sequences define them
Harbison et al., Nature 2004
Genome Evolution. Amos Tanay 2012
Direct measurements of the in-vitro binding affinity of 8-mers and DNA binding domains (here just a library of homeodomains, from Berger et al. 2008)
Genome Evolution. Amos Tanay 2012
Profiling binding affinity to the entire k-mer spectrum provide direct quantification of in-vitro affinity (Badis et al., 2009)
Heatmap of 2D hierarchical agglomerative clustering analysis of 4740 ungapped 8-mers over 104 nonredundant TFs, with both 8- mers and proteins clustered using averaged E-score from thetwo different array designs.
8-mers
104 TFs
Genome Evolution. Amos Tanay 2012
What kind of biological function is naturally selected?
Discrete and deterministic “binding sites” in yeast as identified by Young, Fraenkel and colleuges
In fact, binding is rarely deterministic and discrete, and simple wiring is something you should treat with extreme caution.
Genome Evolution. Amos Tanay 2012
The Halpern-Bruno model for selection on affinity
Ns
s
e
e2
2
1
1
According to Kimura’s theory, an allele with
fitness s and a homogeneous population would fixate with probability:
NsNs
s
ba
NsNs
s
ab
e
s
e
ef
e
s
e
ef
22
2
22
2
1
2
1
1
1
2
1
1
Assuming slow mutation rate (which allow us to assume a homogenous population) and motifs a and b with relative fitness s the fixation probabilities (chance of fixation given that mutation occurred!) are:
NsNs
NsNs
Nsbaab ee
e
s
e
e
sff 2
2
22
2 1
1
2
1
1
2/
If p represent the mutation probability, and the stationary distribution, and if we assume the process as a whole is reversible then:
Ns
ba
ab
aba
bab
ababa
babab ef
f
p
p
fp
fp 21
bab
aba
aba
bab
ab
pppp
f
1
ln
We work on deriving the substitution rate at each position of the binding site, given its observed stationary frequency. We are assuming that the fitness of the site is defined by multiplying the fitness values of all loci. This means fitness is generally linear in the binding energy!
bab
aba
aba
bab
abab
pppp
pcr
1
ln
(Halpern and Bruno, MBE 1998)
1,1 ssfitness
Genome Evolution. Amos Tanay 2012
The Halpern-Bruno model for selection on affinity
Moses et al., 2003
The HB model is limited for the study of general sequences.When restricting the analysis to relatively specific sites, HB is not completely off
Genome Evolution. Amos Tanay 2012
• The entire genome should behave like a mixture of background sequance and functional loci:
• So we can try and recover Q(E) and therefore F(E) from the maximum likelihood parameters fitting an empirical W(E)
Testing the general binding energy – fitness correspondence
• While E(S) is approximated by a PWM, F(E) is unlikely to be linear
• Assume that the background probability of a motif a is P0(a). In detailed balance, and assuming the fitness of a at functional sites is F(a), the stationary distribution at sites can be shown to be:
Mustonen and Lassig, PNAS 2005
)(2)()( aNFo eaPaQ
• If we collapse all sites with binding energy E (and hence the same F(a)=F(E(a))
)(2)()( ENFo eEPEQ
)()()1()( EQEPEW o Inferred F(E), is shown in Orange
Expected and observed energy distribution in E.Coli CRP sites (left) and background (right)
Comparison of CRP energies in E.coli and S. typhimurium
(Hwa and Gerland, 2000-)
Genome Evolution. Amos Tanay 2012
TFBS evolution: purifying selection and conservation
Similar function
Neutral evolution
Disrupted function
Low ratepurifying selection
TF1
TF2
Altered function
Low ratepurifying selection
TF1
CACGCGTACACGCGTT
TF1
CACGAGTTCACGCGTT
CACACGTTCACGCGTT Altered affinity
Rate?Selection?
TF1
CACACGTTCACGCGTT
Genome Evolution. Amos Tanay 2012
Binding sites conservation: heuristic motif identification
Kellis et al., 2003
Genome Evolution. Amos Tanay 2012
Analyzing k-mer evolutionary dynamics
• Instead of trying to identify conserved motifs try to infer the evolutionary rate of substitution between pairs of k-mers
• Start from a multiple alignment and reconstruct ancestral sequences (assuming site independence, or even max parsimony)
• Now estimate the number of substitution between pairs of 8-mers, compare this number to the number expected by the background model
• Do it for a lot of sequence, so that statistics on the difference between observed and expected substitutions can be derived
Genome Evolution. Amos Tanay 2012
Saccharomyces TFBS Selection Network
Arcs: 1nt substitutionRate Selectio
nNormal
Low
neutral
negative
arc
not enough stat
Nodes: octamers
conserved @ 2SD
conserved @ 3SD
node
otherwise
conservation
Inter-island organization in the Reb1 cluster: selection hints toward multi modality of Reb1
Tanay et al., 2004
Genome Evolution. Amos Tanay 2012
Leu3 selection network
log delta affinity
0.3
0.2
0.1
03210-1-2-3-4-5
High Affinity (Kd < 60)
Meidum Affinity (400 > Kd > 60)
High rate subs.
Substitution changing high affinity to high
affinity motifs
Substitution changing high affinity to low affinity motifs
Sub
stitu
tion
rate
Genome Evolution. Amos Tanay 2012
A simple transcriptional code and its evolutionary implications
AAATTTAATTTTAAAATT
GATGAGGATGCGGATGAT
CACGTGCACTTG
ACGCGTTCGCGTACGCGT
All th
e re
st
TGACTGTGAGTGTGACTT
TF1
TF2
TF3
TF4TF5
Genome Evolution. Amos Tanay 2012
The Halpren-Bruno model for selection on affinity
The basic notion here is of the relations between sequence, binding and function/fitness
Sequence
Binding energy
Function )(
)(
EF
SE
We argued that E(S) can be approximated by a PWM
F(E) is a completely different story, for example:Is there any function at all to low affinity binding sites?Is there a difference between very high affinity and plain strong binding sites?Are all appearances of the site subject to the same fitness landscape?
Genome Evolution. Amos Tanay 2012
S. cerevisiae S. mikitaeSimulation(Neutral, context aware)
High affinity
Low affinity
ΔEΔE....
ΔEΔE....
0
0.2
0.4
0.6
0.8
1
0 0.25 0.5
KS statistics
More tests for possible conservation of low binding energy sites
Genome Evolution. Amos Tanay 2012
More tests for possible conservation of low binding energy sites
Tanay, GR 2006
Binding site conservation
Conservation of totalenergy
0
5
10
15
20
0 50 1000
5
10
15
20
0 50 100
0
5
10
15
20
0 50 100
Reb1
Ume6
binding energy percentile
Co
nse
rvat
ion
sco
re
Cbf1 Gcn4Mbp1
binding energy percentile binding energy percentile
0
10
20
30
40
50
60
0 50 100
Co
nse
rvat
ion
sco
re
0
5
10
15
20
0 50 100 binding energy percentile
binding energy percentile
Genome Evolution. Amos Tanay 2012
Evolutionary dynamics of transcription factor binding (mammals)
Schimdt et al. Science 2010
Shared binding loci: 4%
Genome Evolution. Amos Tanay 2012
Evolutionary dynamics of CTCF binding (mammals)
Schimdt et al. Cell 2012
Shared binding loci: 24%