Statistical Analysis for Word counting
in Drosophila Core Promoters
Yogita MantriApril 27 2005
Bioinformatics Capstone presentation
Introduction & Motivation
Dataset used
Part I – Unbiased word counting
Part II – TCAGT-centric word counting
Conclusions and Future work
Introduction
Regulatory elements are short DNA sequences that control gene expression.
They are often found around the Transcription Start Site (TSS), sometimes further upstream.
Identification of promoters and regulatory elements is a major challenge in bioinformatics: Regulatory elements are not well-conserved Computational discovery of TSS in not straightforward Promoter sequences do not have distinguishable statistical
properties Transcription is a highly cooperative process including
competitive or cooperative binding which is not completely determined from the rest of the genome’s DNA sequence
“Computational analysis of core promoters in the DrosophilaGenome”, Ohler, Rubin et. al, Genome Biology 2002, 3(12):research0087.1–0087.12 Above image edited from:http://163.238.8.180/~davis/Bio_327/lectures/Transcription/TranscriptionOver.html
Drosophila Core Promoters
Motivation for project
Database of Core Promoters with TSS experimentally determined is a huge advantage over other approaches using only gene upstream regions.
Word Counting method to determine significant patterns, inspired by Dr. Peter Cherbas’ earlier work.
“The arthropod initiator: the capsite consensus plays an important role in transcription”, Cherbas L, Cherbas P., Insect Biochem Mol Biol. 1993 Jan;23(1):81-90
Introduction & Motivation
Dataset used
Part I – Unbiased word counting
Part II – TCAGT-centric word counting
Conclusions and Future work
The Database of Drosophila Core Promoters
Compiled by Sumit Middha. It consists of Drosophila core promoters from three experimental sources.
Ohler, Rubin et al: 1941 promoters Stringent criteria for identifying TSSs, requiring 5’ ends of
multiple cDNAs to lie in close proximity. Kadonaga et al:
205 promoters Changed TSS to coincide with A of Inr consensus TCAGT even if
experimental results reported TSS in the vicinity. The discrepancy was fixed by taking the experimentally
reported TSS. Eukaryotic Promoter Database:
1926 promoters Assigned TSS based on experimental data with a precision of
+/- 5bp or better. 3458 sequences after removing redundant entries in the
dataset.
Introduction & Motivation
Dataset used
Part I – Unbiased word counting
Part II – TCAGT-centric word counting
Conclusions and Future work
Word Analysis – Part IUnbiased search
Used various statistical measures like Z-score on all possible n-mers in the entire dataset and in specific windows.
The goal was to see whether known patterns of interest were significantly enriched in promoter sequences than other patterns.
Basic Statistics of the dataset
3458 promoter sequences in the database. First step was a word-frequency analysis
(pentamers used for initial analysis) Performed analysis on the following sets:
Entire dataset (DS-1) Subset of above dataset, with only -20 to +20
region (DS-2) 2 types of analyses, differing in “Random”
sequences used: 1st Order Markov Chains based on base and
transition probabilities of respective dataset “non-coding” regions
Random set Generated 100 sets of 1st order Markov
chains Each set contained same number of
sequences as original datase (3458), and having same length (350)
Computed occurrence of each pentamer in actual and random sequences
For random sequences, calculated average and S.D over all sets
Z-score
A test of significance Mean and S.D
calculated over 100 sets
Calculated Z-scores for all pentamers
Looking for pentamers with very high or very low Z-scores
Rank Pattern Z-Score
1 aaaaa 113.037
2 ttttt 111.647
3 ttttg 88.1
4 gaaaa 83.156
5 aaaac 82.69
6 atttt 82.152
7 gtttt 82.067
8 ttttc 79.485
9 aaaat 78.348
10 gcagc 77.091
101 gcagt 29.269
115 tcagt 27.156
307 acagt 10.286
485 tcatt 1.375
965 tataa -25.213
Rank of TCAGT and variants in entire dataset
-20+20
PATTERN Z-Score Rank
tcagt 58.929 2
tcatt 3.6 418
gcagt 25.545 34
acagt 12.923 179
tataa -25 1022
Summary of known pentamers in different windows
Pattern Z-score Rank
tcagt 7.559871 254
tcatt -1.402484 576
gcagt 9.0644839 200
acagt 2.7177419 409
tataa -8.962065 880
Sliding Windows
Pattern Z-score Rank
tcagt 4.277429 356
tcatt -2.00671 590
gcagt 7.714143 246
acagt 2.080429 435
tataa -9.064 898
Non-overlapping windows
Z-score Plots of tcagt and variants using sliding windows of 10 bp
Sliding Window
-40
-20
0
20
40
60
80
100
1-50
20-7
0
40-9
0
60-1
10
80-1
30
100-
150
120-
170
140-
190
160-
210
180-
230
200-
250
220-
270
240-
290
260-
310
280-
330
300-
350
Window
Z-s
co
re
tcagt
acagt
gcagt
tcatt
tataa
cgtcg
aaaaa
atttt
cagcg
atatc
tagta
Lesson Cannot ignore position preference of
regulatory motifs!
Introduction & Motivation
Dataset used
Part I – Unbiased word counting
Part II – TCAGT-centric word counting
Conclusions and Future work
Word Analysis – Part IIGuided search, starting with known INR element TCAGT
Identification of INR enriched regions Identification of synonyms Correlation analysis of INR synonyms Guided search
TCAGT-centric word analysis
Zscore vs Position wrt TSS
-20
0
20
40
60
80
100
120
140
160
-15 -10 -5 0 5 10 15
Position
Z-s
co
re
TCAGT
Window Zscore
(-3,3) 130.58 (-4,2) 116.27 (-2,4) 105.67(-5,1) 98.96(-6,-1) 95.71(-7,-2) 85.83(-1,5) 59.23(1,6) 47.68(2,7) 43.30(3,8) 28.79
Group1 CTCAG---ATCAG---TTCAG---GTCAG--- -TCAGT-----AGTTG---AGTCG--CAGTT---CAGTC-
Group 3 ACACT---
-CACTCTG
Group 4 -TCACA-GTCAC----CACAC
Group 6-CATTCTCATT-
INR Synonyms
Group 2 TTAGT
Group 5TCACTCT
“Computational analysis of core promoters in the DrosophilaGenome”, Ohler, Rubin et. al, Genome Biology 2002, 3(12):research0087.1–0087.12
TOTAL: 3412
18011611
INR+ INR-
TATA+ TATA- TATA+ TATA-
DPE- DPE+ DPE-DPE+DPE+ DPE-DPE+ DPE-
410 1201 397 1404
79 1172331 832369 76 321 232
Binary Tree Representation of Dataset
3 Clusters in INR-positive set
Zscore
-50 -40 -30 -20 -10 0 10 20 30 40
Postition (-40 to +40)
Zsc
ore
s
ggtcacact
ggtcacac
cggtcacac ttcagtcg
cggacgtgtataaaag
tcagt
TATA (-40, -35) DPE (+20, -30)
INR (-10, +2)
50.0
100.0
150.0
200.0
250.0
0.0
TATA+ TATA-
INR+ 410 1201 1611
INR- 397 1404 1801
807 2605
INR+, TATA+ Log Likelihood: 0.073
DPE+ DPE-
INR+ 448 1163 1611
INR- 308 1493 1801
756 2656
INR+, DPE+ Log Likelihood: 0.227
DPE+ DPE-
TATA+ 155 652 807
TATA- 601 2004 2605
756 2656
INR+, DPE+ Log Likelihood: -0.143
Contingency Matrices for INR, TATA, DPE
Zscore vs Postition in INR-neg set
-40 -30 -20 -10 0 10 20 30 40
Position (-40, +40)
Z-s
core
Zscore
tctttctttggtcacac
ctcgaggg
ctatcgat
cggtcacac
ttctttccg
gtcacact
Possible Alternative TATA and INR Synonyms ??
0.0
10.0
90.0
80.0
70.0
60.0
30.0
20.0
40.0
50.0
TATA – 2 ? INR – 2 ?
-100, -40 region
0
10
20
30
40
50
60
70
80
-100 -90 -80 -70 -60 -50 -40
-100, -40, position
Z-s
co
re
actatcgat
ctatcgat
tatcgataaactatcgat
Enrichment further upstream – New Binding Sites?
TOTAL: 3412
18011611
INR+ INR-
TATA+ TATA-
INR_2+ INR_2-
DPE- DPE+ DPE-
DPE+
TATA_2+ TATA_2-DPE+ DPE-
4101201
397 1404
Next Level of Binary Tree analysis
DPE+ DPE-
?
DPE+DPE-
?
Conclusions & Future steps
The main goal of this project was to try to identify significant words based on only statistical over-representation.
The first part of the analysis using an unbiased searching method was successful only in a very narrow range of positions around the TSS.
However, the biased search starting with the Inr consensus revealed the 3 known regulatory elements in that region.
An analysis of the Inr-negative set showed over-expression of patterns in the same positions as the Inr, TATA and DPE should be, and could be possible synonyms.
Thus the word-counting strategy has the potential to reveal: Regulatory motifs and interrelationships that other motif
discovery programs cannot Synonyms for regulatory motifs Dependencies among regulatory motifs
Acknowledgements
Dr. Haixu Tang Dr. Sun Kim Dr. Peter Cherbas Sumit Middha Bioinformatics Research Group