Casey M. Bergman
Faculty of Life SciencesUniversity of Manchester
Inferring functional constraints on Drosophila noncoding DNA from patterns of sequence evolution.
Outline of Talk
• Noncoding DNA, cis-regulatory annotation and Drosophila as a system
• Conserved noncoding sequences are selectively constrained.
• Spatial constraints on noncoding sequences
Higher organisms have ahigher proportion of noncoding DNA
Bacteria15 %
Yeast30 %
Worm70 %
Fly75 %
The function of most noncoding DNA is unknown & unannotated
Bioinformatic & functional analysis of noncoding DNA ⇒
Genome organization
Transcriptional regulation
= Exon
Mef2
Mef2
Mef2
Mef2
Mef2
CG15863
CG12130
CG1418
CG12133
Adam
CG12134
CG12134
eve
TER94
TER94
Pka-R2
Pka-R2
Pka-R2
CG12128
BS 1360
(A)n
Mef2
Mef2
Mef2
Mef2
Mef2
CG15863
CG12130
CG1418
CG12133
Adam
CG12134
CG12134
eve
TER94
TER94
Pka-R2
Pka-R2
Pka-R2
CG12128
BS 1360
Enhancers
AR3/7
2
APRCQ4/6
mes
15RP2
Transposable elements
Goal: comprehensive functional annotation of noncoding sequences in Drososphila
Why is annotation of cis-regulatory sequences important?
• Better understand development
• Better understand mechanisms of transcription
• Provide material for forward genetics
• Provide material for evolutionary biology
• Generate data for systems biology
Why Drosophila as a model system?
~120 Mb of euchromatin~15,000 genes
75% noncoding
Compact, deletion bias
“Pseudogenes” decay rapidly by deletion in Drosophila
Petrov and Hartl (1998) Mol. Biol. Evol. 15:293-302
Genes with complex expression have longer intergenic regions in compact genomes
Nelson, Hersh & Carroll (2004) Genome Biology 5:R25
Longer introns & intergenic regions have slower rates of sequence evolution in Drosophila
Halligan & Keightley (2006) Genome Research 16:875-884
A wealth of comparative genomic data exists for the genus Drosophila
http://species.flybase.nethttp://rana.lbl.gov/drosophila
image from Pavel Tomancak (MPI-Dresden)
Thousands of candidate expression patterns:BDGP embryonic in situ database
http://www.fruitfly.org/cgi-bin/ex/insitu.pl
Systematic annotation of cis-regulatory datain Drosophila: FlyReg & REDfly databases
Bergman et al. (2005) Bioinformatics 21:1747-1749Gallo et al. (2006) Bioinformatics 22:381-383
chr2R: 5485000 5490000 5495000 5500000FlyBase Protein-Coding Genes
FlyReg: Drosophila DNase I Footprint Database
Regulatory elements from ORegAnno
CG12134CG12134
eve TER94TER94
eveUnspecified
evettk
UnspecifiedUnspecified
knihbhbknihbknihbhbknihbhbknihbhb
hbhb
KrKrKr
bcdKrgt
bcdgtKrKrKr
bcdKrKr
bcdKrgthbKr
bcdKrhb
Krhb
UnspecifiedUnspecifiedUnspecified
ttk
Unspecified
ttk
Unspecified
prdeve
UnspecifiedUnspecified
eveprd
UnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecified
panMedMedMedpanpan
tinpan
tinMed
tinMedzfh1pan
tinpanpan
OREG0005984OREG0005985
OREG0005981OREG0005968
OREG0005972
OREG0005967OREG0005970
OREG0005971OREG0005969
OREG0005974OREG0005976OREG0005979OREG0005977
OREG0005980OREG0005978
OREG0005973
OREG0005975
OREG0005982OREG0005983
ORegAnno: Open Regulatory Annotation
Montgomery et al. (2006) Bioinformatics 22:637-640
shnAbd-A
fkh
ko
Dll
dpp
mus209
tsh
bcd
salm
Antp
dl
Ubx
zen
kni
ftz
eve
hb
tll
Kr
Trl
grh
cad
h
en
gt
ttk
cis-regulatory annotation & systems biology
Ashburner & Bergman (2005) Genome Research 15:1661-1667
Outline of Talk
• Noncoding DNA, cis-regulatory annotation and Drosophila as a system
• Conserved noncoding sequences are selectively constrained.
• Spatial constraints on noncoding sequences
mel
sim yak ere tak ana pse
500 bp spacer
Pattern of noncoding sequence evolution in Drosophila: the eve stripe 2 enhancer
block
Are conserved blocks functionally constrained or simply mutational cold spots?
Bergman & Kreitman (2001) Genome Research 11:1335-1345
Clark (2001) Genome Research 11:1319-1320
median: 19 bp
Using population genetics to test of the mutational cold-spot hypothesis
1. Excess of rare derived mutations in blocks relative to spacers
(Non-parametric test - blocks vs. spacers, frequency spectrum)
If blocks are functionally constrained we predict the following:
Using population genetics to test of the mutational cold-spot hypothesis
1. Excess of rare derived mutations in blocks relative to spacers
(Non-parametric test - blocks vs. spacers, frequency spectrum)
2. Excess of mutations in blocks relative to fixed differences
(“MK” test - blocks vs. spacers, polymorphism & divergence)
If blocks are functionally constrained we predict the following:
0 ! 0.1 0.1 ! 0.2 0.2 ! 0.3 0.3 ! 0.4 0.4 ! 0.5 0.5 ! 0.6 0.6 ! 0.7 0.7 ! 0.8 0.8 ! 0.9 0.9 ! 1.0
Derived Allele Frequency
0.0
2.0
4.0
6.0
Fra
ction o
f S
NP
s
1. Excess of rare derived mutations in blocks relative to spacers
(Non-parametric test - blocks vs. spacers, frequency spectrum)
2. Excess of mutations in blocks relative to fixed differences
(“MK” test - blocks vs. spacers, polymorphism & divergence)
spacer
If blocks are functionally constrained we predict the following:
Using population genetics to test of the mutational cold-spot hypothesis
0 ! 0.1 0.1 ! 0.2 0.2 ! 0.3 0.3 ! 0.4 0.4 ! 0.5 0.5 ! 0.6 0.6 ! 0.7 0.7 ! 0.8 0.8 ! 0.9 0.9 ! 1.0
Derived Allele Frequency
0.0
2.0
4.0
6.0
Fra
ction o
f S
NP
s
blockspacer
If blocks are functionally constrained we predict the following:
Using population genetics to test of the mutational cold-spot hypothesis
1. Excess of rare derived mutations in blocks relative to spacers
(Non-parametric test - blocks vs. spacers, frequency spectrum)
2. Excess of mutations in blocks relative to fixed differences
(“MK” test - blocks vs. spacers, polymorphism & divergence)
Conserved blocks in humans are not mutational cold-spots
Drake et al. (2005) Nat. Genet. 38:223-7
Yoruba (African)
American (European)
Beijing+Tokyo (Asian)
block
spacer
If blocks are functionally constrained we predict the following:
block blockspacer
Divergence
Polymorphism
div.
π
1. Excess of rare derived mutations in blocks relative to spacers
(Non-parametric test - blocks vs. spacers, frequency spectrum)
2. Excess of mutations in blocks relative to fixed differences
(“MK” test - blocks vs. spacers, polymorphism & divergence)
Using population genetics to test of the mutational cold-spot hypothesis
Harvesting data from GenBank using PDA: a pipeline to study polymorphism
Casillas & Barbadilla (2004) Nucl. Acids Res. 32:W166-W169
Get sequences & annotations
Input from sequencesfrom Genbank,
corresponding to theDrosophila genus
Minimum of 2sequences per category
MSAparameters
Gene, CDS, exon,intron, 5’UTR,
3’UTR, promoter
Group byspecies & gene
Sequences &annotations
1b
Muscle
Sequencesorganized incategories
2
Alignmentvalidation
Alignmentswith Scores
3
Sequencessubgroups
4
Read geneannotations
8
Extract generegions
Sequences,positions and orientations
9 Alignmentssubgroups
56
Polymorphism
Syn & Non-synpolymorphisms
Linkagedisequilibrium
Codon bias
Diversity AnalysisModule
7Web-based
output
Alignments
Jalview
Output
1a
MySQLdatabase
Seq. manipulations
External programs
OutputDiversity analysis
Low qualitysequences
excluded
Alignqualityvalues
Highly conserved noncoding sequences -the UCSC PhastCons track
chr2R:
Conservation
d_simulansd_sechellia
d_yakubad_erecta
d_ananassaed_pseudoobscura
d_persimilisd_willistoni
d_virilisd_mojavensisd_grimshawi
5485000 5490000 5495000 5500000FlyBase Protein-Coding Genes
12 Flies, Mosquito, Honeybee, Beetle Multiz Alignments & phastCons Scores
PhastCons Conserved Elements (12 Flies, Mosquito, Honeybee, Beetle)
CG12134CG12134
eve TER94TER94
lod=13lod=13lod=48lod=16lod=34lod=24lod=22lod=14lod=13lod=59lod=27lod=38lod=22lod=21lod=31lod=27lod=43
lod=109lod=11lod=41lod=31lod=15
lod=23
lod=15lod=46lod=45
lod=15lod=11
lod=10lod=86lod=153
lod=258
lod=71
lod=112
lod=25
lod=190
lod=119
lod=127
lod=73
lod=111
lod=19lod=121
lod=194lod=78lod=85
lod=112
lod=69
lod=28lod=35lod=72
lod=279
lod=17
lod=36
lod=44lod=12lod=11lod=36
lod=57
lod=12lod=39lod=176
lod=11lod=35
lod=20lod=25
lod=107lod=17lod=23lod=84
lod=22
lod=148
lod=27
lod=91lod=139
lod=42lod=65
lod=91
lod=68lod=125
lod=16lod=182
lod=324
lod=14
lod=13lod=57lod=55lod=17lod=18lod=21lod=81
lod=81
lod=101lod=47
lod=110lod=698
lod=18lod=13
lod=44
lod=19
lod=17
lod=168
lod=98lod=146
lod=37
lod=30lod=15lod=12
lod=27
lod=14lod=69
lod=65
lod=67
lod=76
lod=54
lod=18
lod=96
lod=32
lod=11
lod=144lod=54lod=14
lod=164
lod=20lod=72lod=46
lod=82
lod=103
lod=92
lod=19lod=70
lod=13
lod=150
lod=15
lod=15
lod=50
lod=171
lod=104
lod=116lod=14
lod=91
lod=153
lod=18
lod=32lod=118
lod=177
lod=108lod=14lod=24
lod=23
lod=261lod=61lod=24
lod=102
lod=40
lod=48lod=16
lod=19
lod=18
lod=23lod=36
lod=12lod=18
lod=209
lod=138lod=13
lod=147lod=18lod=25lod=17
lod=41lod=180
lod=65
lod=156
lod=17
lod=163
lod=110lod=99
lod=126lod=214
lod=17lod=98
lod=31lod=22lod=73lod=109
lod=26
lod=44
lod=18
lod=85lod=10
lod=319
lod=32lod=11
lod=124lod=11
lod=10lod=124
lod=34lod=46
lod=22lod=39lod=166
lod=20lod=44lod=103
lod=21
lod=71lod=20
lod=13lod=42
lod=16lod=13
lod=11lod=12
lod=11lod=26lod=14
lod=292lod=1111
lod=1301lod=1091
The data: alignments of ~12 D. melanogaster alleles with reference sequence and 1 D. simulans allele
High frequency derived spacer allele
Low frequency derived block alleles
Glinka (2003) + Ometto (2005)
African
Glinka (2003) + Ometto (2005)
European
Orengo (2004)
European
Intronic 167 173 28
Intergenic 90 93 80
Total loci 257 266 108
# Alleles 11.7 11.8 12.7
bp block 30,683 33,292 28,721
bp spacer 79,317 87,379 47,590
Summary of the polymorphism data sets
0
1,250
2,500
3,750
5,000
polymorphism divergence
Single nucleotide polymorphisms & fixed differences are reduced in conserved blocks
66% reduction in polymorphism
77% reduction in divergence
3334
437
4854
374
Obs
erve
d nu
mbe
r
blockspacer
Casillas, Barbadilla & Bergman (2007) Mol. Biol. Evol. 24:2222-2234
0
0.13
0.26
0.39
0.52
0.65
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
blockspacer
Conserved blocks have an excess of rare derived point mutations
KS test: p<6x10-11
Derived allele frequency (DAF)
Freq
uenc
y
Casillas, Barbadilla & Bergman (2007) Mol. Biol. Evol. 24:2222-2234
0
0.375
0.750
1.125
1.500
Excess of polymorphism in conserved blocks relative to fixed differences between species
Poly
mor
phis
m :
dive
rgen
ce χ2 test:p<5x10-13
Block Spacer
Poly. 437 3334
Div. 374 4854
block spacer
Casillas, Barbadilla & Bergman (2007) Mol. Biol. Evol. 24:2222-2234
0
0.375
0.750
1.125
1.500
Excess polymorphism is observed in both intergenic and intronic conserved blocks
χ2 test:p<5x10-5
0
0.375
0.750
1.125
1.500
Poly
mor
phis
m :
dive
rgen
ce χ2 test:p<3x10-9
block spacer
Intergenic Intronic
Casillas, Barbadilla & Bergman (2007) Mol. Biol. Evol. 24:2222-2234
Blocks and spacers are more constrained than 4-fold degenerate “silent” sites in genes.
0
0.13
0.26
0.39
0.52
0.65
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
blockspacer4-fold
Derived allele frequency (DAF)
Freq
uenc
yBlock vs 4-fold: P = 2.48e−12Spacer vs 4-fold: P = 0.00471
Casillas, Barbadilla & Bergman (2007) Mol. Biol. Evol. 24:2222-2234
Conserved noncoding sequences in Drosophila are selectively constrained for point mutations
• Reduction in polymorphism and divergence in blocks
• Excess of rare alleles in blocks
• Excess of polymorphism relative to divergence in blocks
• Not due to use of spacers as inappropriate control sequences, differences in GC content, or alignment error
• Both intergenic and intronic blocks are constrained
Outline of Talk
• Noncoding DNA, cis-regulatory annotation and Drosophila as a system
• Conserved noncoding sequences are selectively constrained.
• Spatial constraints on noncoding sequences
mel
sim yak ere tak ana pse
500 bp
Pattern of noncoding sequence evolution in Drosophila: the eve stripe 2 enhancer
| slope | ~ 1
Spacing between conserved noncoding sequences is maintained in divergent Drosophila species
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
D. melanogaster
spacer interval length (log[bp])
D.
pse
ud
oo
bscu
ra
spacer
inte
rval le
ngth
(lo
g[b
p])
r = 0.85 p < 10-6
Bergman et al. (2002) Genome Biology 3:0086.
Indels are under constraint in mamamlian noncoding DNA
Lunter et al. (2005) PLoS Comp. Biol 2:e5
Log1
0 (F
requ
ency
) Unique noncoding
Ancestral Repeat
Distance between indels
Similar constraints on insertions and deletions in blocks and spacers
0
0.11
0.22
0.33
0.44
0.550.
10.
20.
30.
40.
50.
60.
70.
80.
91.
0
Freq
uenc
y
Casillas, Barbadilla & Bergman (2007) Mol. Biol. Evol. 24:2222-2234
0
0.375
0.750
1.125
1.500
Poly
mor
phis
m :
dive
rgen
ce χ2 test:p=0.029
block spacer
χ2 test:p=0.568
Block Spacer
Poly. 66 380
Div. 107 901
χ2 test:p=0.029
Different selective constraint on indels not due to low power or small sample size
Indel Observed
Block Spacer
Poly. 437 3334
Div. 374 4854
χ2 test:p<5x10-13
Point Mutation Observed
Block Spacer
Poly. 51 394
Div. 72 935
χ2 test:p=0.007
Point Mutation Rescaled to Indel Observed
A molecular interpretation of conservation in Drosophila noncoding regions
= Conserved noncoding sequence
= Spacer intervals
= Transcription factors
A molecular interpretation of conservation in Drosophila noncoding regions
= Conserved noncoding sequence
= Spacer intervals
= Transcription factors
A hierarchical model of spatial constraints on cis-regulatory regions
binding site binding site
modulemodule
enhancer
cis-regulatory region
//
~ 5 bp
~ 100 bp
~ kbpenhancer
after Ondek et al. (1988) Nature 333:40-45
Acknowledgements
Marty Kreitman
Michael Ashburner
Sue Celniker, Gerry Rubin,Eddy Rubin
Sonia Casillas, Antonio Barbadilla
Stephen Montgomery, Obi GriffithsMarc Halfon, Steve Gallo