Download ppt - Introduction & Motivation Dataset used Part I – Unbiased word counting

Statistical Analysis for Word counting

in Drosophila Core Promoters

Yogita MantriApril 27 2005

Bioinformatics Capstone presentation

Introduction & Motivation

Dataset used

Part I – Unbiased word counting

Part II – TCAGT-centric word counting

Conclusions and Future work

Introduction

Regulatory elements are short DNA sequences that control gene expression.

They are often found around the Transcription Start Site (TSS), sometimes further upstream.

Identification of promoters and regulatory elements is a major challenge in bioinformatics: Regulatory elements are not well-conserved Computational discovery of TSS in not straightforward Promoter sequences do not have distinguishable statistical

properties Transcription is a highly cooperative process including

competitive or cooperative binding which is not completely determined from the rest of the genome’s DNA sequence

“Computational analysis of core promoters in the DrosophilaGenome”, Ohler, Rubin et. al, Genome Biology 2002, 3(12):research0087.1–0087.12 Above image edited from:http://163.238.8.180/~davis/Bio_327/lectures/Transcription/TranscriptionOver.html

Drosophila Core Promoters

Motivation for project

Database of Core Promoters with TSS experimentally determined is a huge advantage over other approaches using only gene upstream regions.

Word Counting method to determine significant patterns, inspired by Dr. Peter Cherbas’ earlier work.

“The arthropod initiator: the capsite consensus plays an important role in transcription”, Cherbas L, Cherbas P., Insect Biochem Mol Biol. 1993 Jan;23(1):81-90


Dataset used




The Database of Drosophila Core Promoters

Compiled by Sumit Middha. It consists of Drosophila core promoters from three experimental sources.

Ohler, Rubin et al: 1941 promoters Stringent criteria for identifying TSSs, requiring 5’ ends of

multiple cDNAs to lie in close proximity. Kadonaga et al:

205 promoters Changed TSS to coincide with A of Inr consensus TCAGT even if

experimental results reported TSS in the vicinity. The discrepancy was fixed by taking the experimentally

reported TSS. Eukaryotic Promoter Database:

1926 promoters Assigned TSS based on experimental data with a precision of

+/- 5bp or better. 3458 sequences after removing redundant entries in the

dataset.


Dataset used




Word Analysis – Part IUnbiased search

Used various statistical measures like Z-score on all possible n-mers in the entire dataset and in specific windows.

The goal was to see whether known patterns of interest were significantly enriched in promoter sequences than other patterns.

Basic Statistics of the dataset

3458 promoter sequences in the database. First step was a word-frequency analysis

(pentamers used for initial analysis) Performed analysis on the following sets:

Entire dataset (DS-1) Subset of above dataset, with only -20 to +20

region (DS-2) 2 types of analyses, differing in “Random”

sequences used: 1st Order Markov Chains based on base and

transition probabilities of respective dataset “non-coding” regions

Random set Generated 100 sets of 1st order Markov

chains Each set contained same number of

sequences as original datase (3458), and having same length (350)

Computed occurrence of each pentamer in actual and random sequences

For random sequences, calculated average and S.D over all sets

Z-score

A test of significance Mean and S.D

calculated over 100 sets

Calculated Z-scores for all pentamers

Looking for pentamers with very high or very low Z-scores

Rank Pattern Z-Score

1 aaaaa 113.037

2 ttttt 111.647

3 ttttg 88.1

4 gaaaa 83.156

5 aaaac 82.69

6 atttt 82.152

7 gtttt 82.067

8 ttttc 79.485

9 aaaat 78.348

10 gcagc 77.091

101 gcagt 29.269

115 tcagt 27.156

307 acagt 10.286

485 tcatt 1.375

965 tataa -25.213

Rank of TCAGT and variants in entire dataset

-20+20

PATTERN Z-Score Rank

tcagt 58.929 2

tcatt 3.6 418

gcagt 25.545 34

acagt 12.923 179

tataa -25 1022

Summary of known pentamers in different windows

Pattern Z-score Rank

tcagt 7.559871 254

tcatt -1.402484 576

gcagt 9.0644839 200

acagt 2.7177419 409

tataa -8.962065 880

Sliding Windows

Pattern Z-score Rank

tcagt 4.277429 356

tcatt -2.00671 590

gcagt 7.714143 246

acagt 2.080429 435

tataa -9.064 898

Non-overlapping windows

Z-score Plots of tcagt and variants using sliding windows of 10 bp

Sliding Window

-40

-20

0

20

40

60

80

100

1-50

20-7

0

40-9

0

60-1

10

80-1

30

100-

150

120-

170

140-

190

160-

210

180-

230

200-

250

220-

270

240-

290

260-

310

280-

330

300-

350

Window

Z-s

co

re

tcagt

acagt

gcagt

tcatt

tataa

cgtcg

aaaaa

atttt

cagcg

atatc

tagta

Lesson Cannot ignore position preference of

regulatory motifs!


Dataset used




Word Analysis – Part IIGuided search, starting with known INR element TCAGT

Identification of INR enriched regions Identification of synonyms Correlation analysis of INR synonyms Guided search

TCAGT-centric word analysis

Zscore vs Position wrt TSS

-20

0

20

40

60

80

100

120

140

160

-15 -10 -5 0 5 10 15

Position

Z-s

co

re

TCAGT

Window Zscore

(-3,3) 130.58 (-4,2) 116.27 (-2,4) 105.67(-5,1) 98.96(-6,-1) 95.71(-7,-2) 85.83(-1,5) 59.23(1,6) 47.68(2,7) 43.30(3,8) 28.79

Group1 CTCAG---ATCAG---TTCAG---GTCAG--- -TCAGT-----AGTTG---AGTCG--CAGTT---CAGTC-

Group 3 ACACT---

-CACTCTG

Group 4 -TCACA-GTCAC----CACAC

Group 6-CATTCTCATT-

INR Synonyms

Group 2 TTAGT

Group 5TCACTCT

“Computational analysis of core promoters in the DrosophilaGenome”, Ohler, Rubin et. al, Genome Biology 2002, 3(12):research0087.1–0087.12

TOTAL: 3412

18011611

INR+ INR-

TATA+ TATA- TATA+ TATA-

DPE- DPE+ DPE-DPE+DPE+ DPE-DPE+ DPE-

410 1201 397 1404

79 1172331 832369 76 321 232

Binary Tree Representation of Dataset

3 Clusters in INR-positive set

Zscore

-50 -40 -30 -20 -10 0 10 20 30 40

Postition (-40 to +40)

Zsc

ore

s

ggtcacact

ggtcacac

cggtcacac ttcagtcg

cggacgtgtataaaag

tcagt

TATA (-40, -35) DPE (+20, -30)

INR (-10, +2)

50.0

100.0

150.0

200.0

250.0

0.0

TATA+ TATA-

INR+ 410 1201 1611

INR- 397 1404 1801

807 2605

INR+, TATA+ Log Likelihood: 0.073

DPE+ DPE-

INR+ 448 1163 1611

INR- 308 1493 1801

756 2656

INR+, DPE+ Log Likelihood: 0.227

DPE+ DPE-

TATA+ 155 652 807

TATA- 601 2004 2605

756 2656

INR+, DPE+ Log Likelihood: -0.143

Contingency Matrices for INR, TATA, DPE

Zscore vs Postition in INR-neg set

-40 -30 -20 -10 0 10 20 30 40

Position (-40, +40)

Z-s

core

Zscore

tctttctttggtcacac

ctcgaggg

ctatcgat

cggtcacac

ttctttccg

gtcacact

Possible Alternative TATA and INR Synonyms ??

0.0

10.0

90.0

80.0

70.0

60.0

30.0

20.0

40.0

50.0

TATA – 2 ? INR – 2 ?

-100, -40 region

0

10

20

30

40

50

60

70

80

-100 -90 -80 -70 -60 -50 -40

-100, -40, position

Z-s

co

re

actatcgat

ctatcgat

tatcgataaactatcgat

Enrichment further upstream – New Binding Sites?

TOTAL: 3412

18011611

INR+ INR-

TATA+ TATA-

INR_2+ INR_2-

DPE- DPE+ DPE-

DPE+

TATA_2+ TATA_2-DPE+ DPE-

4101201

397 1404

Next Level of Binary Tree analysis

DPE+ DPE-

?

DPE+DPE-

?

Conclusions & Future steps

The main goal of this project was to try to identify significant words based on only statistical over-representation.

The first part of the analysis using an unbiased searching method was successful only in a very narrow range of positions around the TSS.

However, the biased search starting with the Inr consensus revealed the 3 known regulatory elements in that region.

An analysis of the Inr-negative set showed over-expression of patterns in the same positions as the Inr, TATA and DPE should be, and could be possible synonyms.

Thus the word-counting strategy has the potential to reveal: Regulatory motifs and interrelationships that other motif

discovery programs cannot Synonyms for regulatory motifs Dependencies among regulatory motifs

Acknowledgements

Dr. Haixu Tang Dr. Sun Kim Dr. Peter Cherbas Sumit Middha Bioinformatics Research Group