13
Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) 177 A STUDY OF PROMOTER ELEMENTS OF HOMO SAPIENS Yunkai Liu, Sujuan Ye and Asai Asaithambi Department of Computer Science University of South Dakota Vermillion, SD 57069 ABSTRACT Promoters are essential functional units of gene transcription. e discov- ery of combinational patterns of promoter elements is a challenge problem in computational biology. In this paper, we studied promoter elements (PEs) in extended core promoter regions of homo sapiens using a binary matrix repre- sentation. Our study revealed many new interesting patterns of occurrence or non-occurrence of these PEs. Sp1 and Inr are the most commonly occurring PEs. We observed that, two different PEs do not occur simultaneously, and the same PE does not occur in different windows simultaneously. e set of PE-Window combinations, which we call features, lends itself to a partitioning into two sub- sets such that the presence of any feature in one set implies the absence of any feature in the other. Communities characterized by high occurrences of a select set of features exhibit some significant properties with respect to the presence or absence of other features. Key words Core promoters, promoter elements, features, binary representation, com- munities INTRODUCTION Promoters are essential functional units of gene transcription that integrate all signals influencing gene transcription on the molecular level. ey are gener- ally found around the transcription start site (TSS) within the gene. It is known that a narrow region around the TSS has distinct nucleotide compositional properties Majewski and Ott [1]. is region, known as the core promoter re- gion, is important in understanding the functionalities of promoters Butler and Kadonaga [2]. e detection of promoter region and the determination of the location of TSS are difficult problems in computational biology. is is mainly due to the fact that the definition of the core promoter region, used by Butler and Kadonaga [2] and Smale and Kadonaga [6], as the minimal continuous seg- ment of DNA sufficient for accurate initiation and direction of transcription, is somewhat incomplete in the sense that it does not specify the length of the

A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) 177

A STUDY OF PROMOTER ELEMENTS OF HOMO SAPIENS

Yunkai Liu, Sujuan Ye and Asai AsaithambiDepartment of Computer Science

University of South DakotaVermillion, SD 57069

ABSTRACT

Promoters are essential functional units of gene transcription. The discov-ery of combinational patterns of promoter elements is a challenge problem in computational biology. In this paper, we studied promoter elements (PEs) in extended core promoter regions of homo sapiens using a binary matrix repre-sentation. Our study revealed many new interesting patterns of occurrence or non-occurrence of these PEs. Sp1 and Inr are the most commonly occurring PEs. We observed that, two different PEs do not occur simultaneously, and the same PE does not occur in different windows simultaneously. The set of PE-Window combinations, which we call features, lends itself to a partitioning into two sub-sets such that the presence of any feature in one set implies the absence of any feature in the other. Communities characterized by high occurrences of a select set of features exhibit some significant properties with respect to the presence or absence of other features.

Key words

Core promoters, promoter elements, features, binary representation, com-munities

INTRODUCTION

Promoters are essential functional units of gene transcription that integrate all signals influencing gene transcription on the molecular level. They are gener-ally found around the transcription start site (TSS) within the gene. It is known that a narrow region around the TSS has distinct nucleotide compositional properties Majewski and Ott [1]. This region, known as the core promoter re-gion, is important in understanding the functionalities of promoters Butler and Kadonaga [2]. The detection of promoter region and the determination of the location of TSS are difficult problems in computational biology. This is mainly due to the fact that the definition of the core promoter region, used by Butler and Kadonaga [2] and Smale and Kadonaga [6], as the minimal continuous seg-ment of DNA sufficient for accurate initiation and direction of transcription, is somewhat incomplete in the sense that it does not specify the length of the

Page 2: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007)

region. For example, promoter elements TATA box and Initiator always function together to active genetic transcription. However, they have different density distributions in the promoter area. Thus, the study of combinatorial patterns of promoter elements are an important and challenge problem in both biology and computational science. In this paper we are not only interested in the binding sites that have been identified biologically as those which are initiated by specific transcription factors such as the RNA Polymerase II, but also those short DNA subsequences that have high frequencies in the promoter area even though their biological signifi-cance has not been discovered biologically. We call both kinds of subsequences as promoter elements (PEs) in this paper. Regularities in promoter regions are usually studied in terms of occurrence patterns of PEs in them. It is important to note that studies of regularities in promoter regions of homo sapiens reported in the literature thus far, for example, those of Suzuki et al. [8] and Bajic et al. [10], seem to concentrate on the occurrence patterns or over-representation of indi-vidual PEs in the promoter regions or their subregions studied. The intention of our study was to investigate pairs of PEs in specific locations with respect to the occurrence or non-occurrence patterns of each member of the pair in relation to the corresponding patterns of the other member. Our study revealed some inter-esting new relationships that could form the basis for further in-depth analyses of regularities and provide more assistance in the accurate determination of the promoter region and the TSS location. We studied 20 different promoter elements, some of which have been al-ready identified, and some newly found, in three windows in the promoter area of Homo Sapiens. For identifying new PEs, we started with an exhaustive search for arbitrary sub-sequences of length of 6~8 nt in the promoter sequences studied and record-ed their frequencies of occurrences. Then, we used combinations of several tools for PE finding and PE selection in each of the three windows for our study. This helped us narrow the study to a set of 20 different PEs, each of which might have occurred in one or more of the three windows. By comparing pairs of columns of the binary matrix representing occurrence patterns of the 20 PEs, we were able to identify several interesting regularities that have not been identified thus far. First, we discovered that when we consider each PE-Window combination as a unique feature, two different features rarely occurred together, and the presence of a select set of features implies the absence of a select of other features. Second, the binary matrix representation also facilitated further analysis of “communi-ties” characterized by richness of a select group of features, called base features, taken one at a time, which revealed more regularities. We defined and calculated a community representation index for each of the other features and discovered that some features are overrepresented while some others are underrepresented in these communities.

Page 3: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) 179

MATERIALS AND METHODS

As the promoter data we selected all human promoters which belong to the non-redundant groups as defined in the Eukaryotic Promoter Database (EPD) and the Database of Transcriptional Start Sites (DBTSS). Our data sets included 1,532 genes from EPD and 9,137 genes from DBTSS. We used the region of [-100, +50] relative to the annotated TSS location as given in the two databases. We further subdivided this region into the three overlapping windows W1: [-100, -50], W2: [-60, +10] and W3: [+1, +50], taking into account the fact that the accuracy of the TSS location is within ±10 nt so that any possible imprecision of the annotated TSS can be accommodated. As non-promoter sequences, we used the coding sequences corresponding to these genes in our data sets [10]. The upstream sequences in the region of [-500, +100] were also used to check positional bias [8]. All those data are downloaded from the European Molecular Biology Laboratory (EMBL) and the National Center for Biotechnology Information (NCBI).

Table 1. In this paper, we focused on 20 different promoter elements (PEs). The list of PEs studied includes those binding sites that have been identified to initiate gene expression and those sub-sequences that have high density and positional bias in promoter area. In the rest of the paper, we will refer to each PE-window pair as a feature characterizing the promoter region. Totally 50 features were studied in our data sets.

BIOLOGICALLY IDENTIFIED PEs NEWLY FOUND PEs

PE Names PE sequences Where PE sequences Where

TATA TATAAA W2 SATTGGY W1, W2

BRE SSRCGCC W1, W2, W3 SATTGGY W1, W2

Inr YYANWYY W1, W2, W3 GGGSSSGGGC W1, W2

Med-1 GCTCCS or SGGAGC W1, W2, W3 SCGGAAG W1, W2, W3

CAAT CCAAT or ATTGG W1, W2, W3 TTCCGG W1, W2, W3

E CANNTG W1, W2, W3 GRGSMGGRRG W1, W2, W3

Sp1 CCGCCC or CMGTTK W1, W2, W3 GCGSMKGSG W1, W2, W3

c-myb YAACKG or CMGTTK W1, W2, W3 RYGGCGGC W2, W3

GATA WGATMR or YKATCW W1, W2, W3 AAGATG W3

PU.1 DVVGGAAVY W1, W2, W3 SATGGCG W3

Among the many promoter elements which have been biologically identi-fied as those that bind to the RNA Polymerase II and initiate gene expression in Homo Sapiens, we selected to study 10 well-known “biologically significant” PEs in each window [7] [8] [9], whose densities in our data sets range from 6% to 30%. However, it should be mentioned that some special PEs, such as DPE, were not considered in this study because we felt that due to their high density of

Page 4: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

180 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007)

occurrence, it was not reasonable to study them along with other rarely occurring or studied PEs (see Table 1). As mentioned earlier, we obtained new PEs by directly enumerating all pos-sible subsequences of length 6~8 nt in the promoter sequences studied and record-ing their frequencies of occurrences in each window. Simultaneously, we used Gibbs sampling (AlignACE) for the initial PE-finding [12]. In order to ensure that those subsequences have positional bias in the promoter area, we tabulated the frequencies of occurrences of each subsequence in twelve disjointed segments in the extended region [-500, +100], and selected those subsequences with a cut-off variance of 7 [8]. Furthermore, we narrowed the selection of new PEs by random bootstrap sampling in each window [11] and by comparing the frequencies of occurrence in promoter sequences with the corresponding frequencies in the non-promoter sequences ([10], see Table 2 for more details). Finally, we selected 10 new PEs, and we decided not to study a PE in a specific window if its frequency of occurrence in that window was less than 5% (see Table 1).

Table 2. We narrow the selection of newly found PEs by random bootstrap sampling in each window and by comparing the frequencies of their occurrence in promoter sequences with the corresponding frequencies in the non-promoter sequences. In this table, for each window, we denote the number of promoter sequences in which a PE occurs by f (referred to as frequency), frequency of the PE in the randomly sampled sequences by f1, and the frequency of the PE in the coding sequences by f2. The ratios R1 = f / f1 and R2 = f / f2 have been useful in the final selection of the PEs for our study. The PE sequences are represented using four bases A, T, G, C, and eleven degenerate IUPAC Nucleotide Codes [13].

Newly Found PEs Where Database f f1 R1 f2 R2

SATTGGY W1EPD

DBTSS89 470

17 98

5.24 4.80

10 72

8.90 6.53

GGGSSSGGGC W1EPD

DBTSS59 411

1 58

59.00 7.09

29 153

2.03 2.69

GGGAGGRR W1EPD

DBTSS46 515

9 65

5.11 7.92

31 218

1.48 2.36

SCGGAAG W2EPD

DBTSS126 675

12 103

10.50 6.55

12 124

10.50 5.44

TTCCGG W2EPD

DBTSS121 623

11 101

11.00 6.17

17 176

7.12 3.54

GRGSMGGRRG W2EPD

DBTSS54 529

13 116

4.15 4.56

36 308

1.50 1.72

GGGSGCGG W2EPD

DBTSS70 547

2 28

35.00 19.54

51 437

1.37 1.25

AAGATG W3EPD

DBTSS80 420

51 323

1.57 1.30

8 60

10.00 7.00

SATGGCG W3EPD

DBTSS108 570

7 45

15.43 12.67

11 78

9.82 7.31

RYGGCGGC W3EPD

DBTSS105 704

10 68

10.50 10.35

24 141

4.38 4.99

Page 5: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) 181

ANALYSIS OF FEATURES

Binary Representation of Features

After selecting the PEs we wanted to investigate, we represented their oc-currence patterns in the windows as a binary matrix, in which we used 1 (or 0, respectively) to denote the occurrence (or non-occurrence) of a PE in a specific window (denoted as a feature). Our choice of a binary matrix representation for the occurrence patterns was driven by our observation that each feature that occurred more than once appears in up to 15% of the promoter sequences we examined. For instance, the TATA box in window 1 appears at least once in 6 of the 1532 sequences in EPD and 22 of the 9137 sequences in DBTSS, and it never occurs more than once in either database in any of the three windows stud-ied. This behavior was present in the occurrence patterns of all the 50 features we studied (see Table 3 for more details). Therefore we concluded that it suffices to consider the presence or absence of a feature as opposed to how many times a PE occurred in each window. The compilation of data shown in Table 3 also helped us conclude that it is not necessary to study each PE in all the three windows. Thus the occurrence pattern for these features was represented by MEPD, a 1532 × 50 binary matrix for the EPD data, and by MDBTSS, a 9137 × 50 binary matrix for the DBTSS data. The rows of these matrices correspond to the promoter sequences. Each column of these matrices represents a feature, namely, a unique PE-window combination.

Definitions of Relationships between Features

We compared the occurrence patterns of features by comparing pairs of columns of the binary matrix representing the data. In-depth analysis was per-formed based on (i) a similarity score, and (ii) the density of occurrence of 1s. Let M be an n × m matrix, where n (rows) represents the number of pro-moter sequences and m (columns) the number of features.

Definition 1. Let i and j denote any two different columns of M. Then we define the similarity index sij as the proportion of promoter sequences in which column i and column j have identical entries, either both are 0 or both are 1. Equivalently,

In other words, sij denotes the proportion of the promoter sequences data in which column i and column j are identical. We interpreted a very high similarity index as an indication that it is sufficient to study the PE-window combination corresponding to either column i or column j and not both. We noted that comparison of any two columns i and j in the matrices MEPD and MDBTSS corresponds to studying the frequencies of the ordered pairs (0, 0), (0, 1), (1, 0), and (1, 1) in the submatrices induced by these two columns. Since (0, 0) corresponds to the non-occurrence of either of features, we decided to use more meaningful measures of similarity as follows.

and column j have identical entries, either both are 0 or both are 1. Equivalently,

4. �=

=kjki MMk

ij

nS

:

11

denotes the proportion of the promoter sequences data in

and column j are identical. We interpreted a very high similarity

Page 6: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

182 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007)

Table 3. This table shows the features (PE-window combinations) we studied and the number of promoter sequences in which each of these features occurred at least once (f) and the number of sequences in which each feature occurred multiple number of times (fm). We observe that fm is much smaller (up to 15%) compared to f for each feature.

PE Name DatabaseW1 W2 W3

f fm f fm f fm

TATA EPD DBTSS

622

00

118 320

02

127

00

Ins EPD DBTSS

208 1514

18175

354 2409

56376

163 1105

898

BRE EPD DBTSS

145 1100

471

210 1414

12160

91761

540

Med-1 EPD DBTSS

184 1420

21126

274 1914

31242

234 1733

21211

CAAT EPD DBTSS

255 1382

29141

137 8347

974

22228

02

E EPD DBTSS

161 1255

596

256 1878

31229

193 1310

13104

GATA EPD DBTSS

81422

1044

74585

960

58399

931

Sp1 EPD DBTSS

381 2660

85647

404 3067

85753

126 941

548

PU.1 EPD DBTSS

104 563

219

179 986

1999

70423

25

c-myb EPD DBTSS

64391

311

111 728

729

63422

115

SATTGGY EPD DBTSS

89470

314

49265

07

247

00

GGGSSSGGGC EPD DBTSS

59411

347

58502

539

550

00

SCGGAAG EPD DBTSS

60285

212

126 675

1468

24138

11

TTCCGG EPD DBTSS

47293

09

121 623

1671

40274

04

GRGSMGGRRG EPD DBTSS

46442

138

54529

741

19223

09

GCGSMKGSG EPD DBTSS

86537

146

123 864

982

68511

439

GGGAGGRR EPD DBTSS

46515

349

63560

347

16179

03

RYGGCGGC EPD DBTSS

33241

119

58419

362

105 704

785

AAGATG EPD DBTSS

1156

01

20111

00

80420

05

SATGGCG EPD DBTSS

1166

01

1277

00

108 570

12

Page 7: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) 183

Definition 2. Let i and j denote any two different columns of M. Then we define the maximal non-zero comparison index MNZCij as the maximum of the frequencies of the ordered pairs (0, 1), (1, 0), and (1, 1) in the submatrix induced by columns i and j expressed as a proportion of the total number of ordered pairs n. Equivalently, if we let

then,

We also define the direction of the maximal non-zero content (DMNCij) induced by column i and j as

Note that the direction information from the above definition is significant in that it provides insight into the relationships between pairs of features. In particular, DMNCij = 1 corresponds to the absence of feature i accompanied by the simultaneous presence of feature j, DMNCij = 2 corresponds to the presence of feature i accompanied by the simultaneous absence of feature j, and DMNCij

= 3 corresponds to the presence of both features.

Definition 3. Let i and j denote any two different columns of M. Then the dominant non-zero comparison index DNZCij is defined as the maximum of the frequencies of the ordered pairs (0, 1), (1, 0), and (1, 1) in the submatrix induced by columns i and j expressed as a proportion of the total number of nonzero ordered pairs pij + pij + pij. Equivalently,

Analysis of the Relationships among PE-Window Combinations

Note that the column indexes i and j used in the above definitions range from 1 to 50 since there are 50 features in all. Therefore, a final consequence of these definitions is that the study of occurrence patterns in the two databases now has been reduced to analyzing the structure of the three 50 × 50 matrices MNZC, DMNC, and DNZC for each of the two databases EPD and DBTSS

npij

11,0= [ frequency of (0, 1) in the submatrix induced by columns i and j ],

npij

10,1= [ frequency of (1, 0) in the submatrix induced by columns i and j ],

npij

11,1= [ frequency of (1, 1) in the submatrix induced by columns i and j ],

then,

).,,max( 1,10,11,0

ijijijijpppMNZC =

We also define the direction of the maximal non-zero content (DMNCij) induced

by column i and j as

ijij

ijij

ijij

ij

MNZCp

MNZCp

MNZCp

if

if

if

DMNC

=

=

=

��

��

=1,1

0,1

1,0

3

2

1

Note that the direction information from the above definition is significant in that it

npij

11,0= [ frequency of (0, 1) in the submatrix induced by columns i and j ],

npij

10,1= [ frequency of (1, 0) in the submatrix induced by columns i and j ],

npij

11,1= [ frequency of (1, 1) in the submatrix induced by columns i and j ],

then,

).,,max( 1,10,11,0

ijijijijpppMNZC =

We also define the direction of the maximal non-zero content (DMNCij) induced

by column i and j as

ijij

ijij

ijij

ij

MNZCp

MNZCp

MNZCp

if

if

if

DMNC

=

=

=

��

��

=1,1

0,1

1,0

3

2

1

Note that the direction information from the above definition is significant in that it

npij

11,0= [ frequency of (0, 1) in the submatrix induced by columns i and j ],

npij

10,1= [ frequency of (1, 0) in the submatrix induced by columns i and j ],

npij

11,1= [ frequency of (1, 1) in the submatrix induced by columns i and j ],

then,

).,,max( 1,10,11,0

ijijijijpppMNZC =

We also define the direction of the maximal non-zero content (DMNCij) induced

by column i and j as

ijij

ijij

ijij

ij

MNZCp

MNZCp

MNZCp

if

if

if

DMNC

=

=

=

��

��

=1,1

0,1

1,0

3

2

1

Note that the direction information from the above definition is significant in that it

0,1 0,1 0,1ijp +

ijp . Equivalently,

.),,max(

1,10,11,0

1,10,11,0

ijijij

ijijij

ij

ppp

pppDNZC

++=

Page 8: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

184 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007)

from which we obtained our promoter sequences. Interestingly, the structure of each of the matrices for EPD and the structure of each corresponding matrix for DBTSS were identical. We made the following identical observations for both sets of matrices. First, the entries in MNZC ranged from 0.1 to 0.3. This indicates that there is a very high-percentage of the ordered pair (0, 0) in all pairs of columns com-pared. This observation is consistent with the fact that both MEPD and MDBTSS from which MNZC were derived were highly sparse, consisting of only about 9% of 1s.

Table 4. When cut-off thresholds of MNZCij > 0.2 and DNZCij > 0.7 are used to describe the existence of a relationship between any two features i and j, the structure of the matrix reveals a partitioning of a subset of the feature set. The occurrence of each feature from the left column of this table implies the non-occurrence of each feature from the right column. Recall that W1

= [-100, -50], W2 = [-60, +10], and W3 = [0, +50].

PE-Window Set 1 PE-Window Set 2

Sp1 in W1, W2 c-myb in W1, W2, W3 Inr in W2 GATA in W1, W2, W3

PU.1 in W1, W3

SATTGGY in W1, W2

GGGSSSGGGC in W1, W2

GGGAGGRR in W1, W2

SCGGAAG in W1, W2, W3

TTCCGG in W1, W2, W3

GRGSMGGRRG in W1, W2, W3

GCGSMKGSG in W1, W2, W3

TATA in W2

CAAT in W2

RYGGCGGC in W2, W3

BRE in W3

CAAT in W3

Sp1 in W3

AAGATG in W3

SATGGCG in W3

Second, the directional information contained in the DMNC matrices contained only 1s and 2s, but no 3s. This indicates that the ordered pairs (0, 1) and (1, 0) were more significant than the pair (1, 1) based on our criteria. In other words, strong similarities between any two features that may be signaled by the occur-rence of both features are insignificant; on the other hand, the presence of one feature accompanied by the simultaneous absence of another feature, seems to

Page 9: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) 185

be more significant. While the diverse distribution of features may explain the difficulty associated with studying PEs or PE-based features in the promoter area, this observation leads us to believe that the non-occurrences of features may also be able to describe the environment, biological or computational, around the TSS or the core promoter area. Third, in order to get more information on the occurrence patterns, for each database we constructed a 50 × 50 matrix A defined by

Interestingly, we found that A had the same structure for both databases. In both cases, A had the structure of the adjacency matrix for a complete bipartite graph induced by a subset of the feature set. In addition, it was even more inter-esting to see that DMNCij = 2 (corresponding to (1, 0) being the dominant rela-tionship) whenever Aij ≠ 0. This means that if we let F denote the subset of the features (PE-Window combinations) defining the bipartite graph represented in A, then F may be partitioned into two subsets F1 and F2 such that the occurrence of a feature in F1 implies the non-occurrence of any feature in F2. Also, no two features both of which belong to F1 and no two features both of which belong to F2 bear such relationship with respect to their occurrences. Table 4 shows the partition of the set of features. It is also important to note that extending our study of two columns using nonzero ordered pairs to three columns using non-zero triples (or more, say k, columns using corresponding nonzero k-tuples) did not result in any more partitioning of the feature set. The partition as shown in Table 4 is also consistent with the observation that Sp1 in W1 and W2, and Inr in W1 have the highest occurrence densities in our data sets.

Communities Based on Feature Partitioning

Studies on the promoter region thus far have been based on specific PEs such as the GC-content and TATA box. The GC-content of a sequence, determined as the ratio of the sum of G and C nucleotides over the total number of nucleotides in the sequence ([14] [15]), has been useful to demonstrate that other PEs have strong preference in areas with different GC-contents. This is natural since the promoter elements themselves have different percentages of G and C. Therefore, it is not surprising to find that more GC-rich PEs are present in high GC-con-tent (60% or more) areas [10]. Next, although TATA box is one of the most popular PEs that have been identified to bind with the RNA Polymerase II [6], it is not directly useful to describe or predict the TSS or the core promoter region because of its low density of occurrence (less than 8%) in the data sets for homo sapiens that we studied.

9

Second, the directional information contained in the DMNC matrices contained

only 1s and 2s, but no 3s. This indicates that the ordered pairs (0, 1) and (1, 0)

were more significant than the pair (1, 1) based on our criteria. In other words,

strong similarities between any two features that may be signaled by the

occurrence of both features are insignificant; on the other hand, the presence of

one feature accompanied by the simultaneous absence of another feature,

seems to be more significant. While the diverse distribution of features may

explain the difficulty associated with studying PEs or PE-based features in the

promoter area, this observation leads us to believe that the non-occurrences of

features may also be able to describe the environment, biological or

computational, around the TSS or the core promoter area.

Third, in order to get more information on the occurrence patterns, for each

database we constructed a 50 � 50 matrix A defined by

��� >>

=.,0

7.02.0,1

otherwise

DNZCandMNZCifA

ijij

ij

Interestingly, we found that A had the same structure for both databases. In both

cases, A had the structure of the adjacency matrix for a complete bipartite graph

induced by a subset of the feature set. In addition, it was even more interesting to

see that DMNCij = 2 (corresponding to (1, 0) being the dominant relationship)

whenever Aij �0. This means that if we let F denote the subset of the features

(PE-Window combinations) defining the bipartite graph represented in A, then F

may be partitioned into two subsets F1 and F2 such that the occurrence of a

SATTGGY in W1, W2

GGGSSSGGGC in W1, W2

GGGAGGRR in W1, W2

SCGGAAG in W1, W2, W3

TTCCGG in W1, W2, W3

GRGSMGGRRG in W1, W2, W3

GCGSMKGSG in W1, W2, W3

TATA in W2

CAAT in W2

RYGGCGGC in W2, W3

BRE in W3

CAAT in W3

Sp1 in W3

AAGATG in W3

SATGGCG in W3

Page 10: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

186 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007)

Table 5. The features with high and low CRIs in the community based on Sp1 in W1.

FeaturesEPD Database DBTSS Database

fic fi

e CRI fic fi

e CRI

GGGSSSGGGC in W1

GRGSMGGRRG in W1 Med-1 in W1

582878

5945182

3.301973531 2.089984779 1.439530333

370229511

399431

1304

2.906636339 1.665405838 1.228302057

GATA in W2 c-myb in W1 Inr in W1

61033

6064197

0.335890411 0.524828767 0.562659064

8070258

435372

1250

0.57645127 0.589816572 0.646951261

Table 6. The features with high and low CRIs in the community based on Inr in W2.

FeaturesEPD Database DBTSS Database

fic fi

e CRI fic fi

e CRI

Inr in W3

c-mby in W2

c-mby in W3

512323

1518963

1.3314452421.4173922471.439187465

317197109

937596362

1.4013440891.3691312431.417463471

GGGSSSGGGC in W2 SCGGAAG in W2 SCGGAAG in W3

7142

5711424

0.4841202690.4841202690.328510182

6312627

492630138

0.5303963920.8284286510.810419332

Table 7. The features with high and low CRIs in the community based on Sp1 in W2.

FeaturesEPD Database DBTSS Database

fic fi

e CRI fic fi

e CRI

GGGSSSGGGC in W2

GRGSMGGRRG in W3 BRE in W3

523131

575494

3.299280652.0761498961.351226944

436275320

492523720

2.650719451.5727995761.329412813

Inr in W1 GATA in W1 GATA in W2

28114

1977460

0.514023030.5375906880.241101278

2846767

1250382435

0.6724170010.5246308420.460710302

Therefore, we propose a new way to partition the data sets into different commu-nities each of which is based on the existence of a distinct feature (PE-Window combination). We have chosen the features shown in the left column of Table 4. We will call these features (Sp1 in W1, Inr in W2, and Sp1 in W2) the base features in the rest of this section. Each community consists of those genes in which one of the base features occurs. The communities corresponding to pairs of base fea-tures may overlap, but we do not consider the overlap as significant because the percentage of overlap does not exceed 10%. We evaluate the preference properties of other features (PE-Window com-binations) in these communities using a community representation index (CRI) for each feature. This index is defined as follows:

Page 11: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) 187

Definition 4. The community representation index (CRI) of the i-th feature is defined as

where fic is the frequency of i-th feature in the specific community, fi

e is the frequency of i-th feature in the entire data set; fe

base is the frequency of the base feature in the entire data set; and N is the size of the entire data set.

Note that the community representation index (CRI) is able to show high or low density of a feature in a specific community when compared to its density in the entire set. If CRI(i) = 1, it means the density of the feature in the community is the same as its density in the entire set. Thus, a CRI(i) > 1 will correspond to the i-th feature being overrepresented in the community, while a CRI(i) < 1 will correspond to underrepresentation. As can be seen in Table 5, Table 6 and Table 7, GGGSSSGGGC in W1 has high CRI in the community based on Sp1 in W1, GGGSSSGGGC in W2 has high CRI in the community based on Sp1 in W2, but GGGSSSGGGC in W1 has low CRI in the community based on Inr in W2. Also, Inr in W1 has low CRI in the community based on Sp1 in W1 as well as the community based on Sp1 in W2, but Inr in W3 has high CRI in the community based on Inr in W2. Furthermore, GATA in W1 and GATA in W2 have low CRI in the community based on Sp1 in W2, SCGGAAG in W2 and SCGGAAG in W3 have low CRI in the community based on Inr in W2, and c-mby in W2 and c-mby in W3 have high CRI in the community based on Inr in W2.

DISCUSSION

In this paper, we studied the occurrence patterns of features (combination of promoter elements and their locations) using different definitions of possible relationships among them. Binary matrix representations for the occurrence patterns as well as for comparisons of features facilitated our analyses. We have defined and used nonzero and dominant comparison indexes to draw conclu-sions on the occurrence and non-occurrence relationships of pairs of features. Using these measures, we were able to identify a partitioning of a subset of the feature set that helped us conclude that the presence of Sp1 in [-100, -50], Inr in [-60, +10], and Sp1 in [-60; +10] implied the absence of several of the promoter elements we studied in three windows. Also, using these three features as base features, we were able to draw conclusions regarding the extent to which other features are represented in the core promoter region, using a measure which we called the community representation index for each feature. Based on our study and analysis, we strongly believe that some biological or chemical explanations for the patterns we have discovered will emerge. We also believe that the study of several communities coupled with reasonable features has great potential for the discovery of more interesting relationships such as those we found.

N

f

f

fiCRI

e

base

e

i

c

i ÷=)(

Page 12: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

188 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007)

References

Majewski, J. and Ott, J. (2002). Distribution and characterization of regulatory elements in the human genome. Genome Res. 12, 1827-1836.

Butler, J. E. and Kadonaga, J. T. (2002). The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 16, 2583-2592.

Werner, T. (1999). Models for prediction and recognition of eukaryotic promot-ers. Mamm. Genome 10, 168-175.

Chen, B. S. and Hampsey M. (2002). Transcription activation: unveiling the essential nature of TFIID. Curr. Biol. 12, 620-622.

Kadonaga, J. T. (2002). The DPE, a core promoter element for transcription by RNA polymerase II. Exp. Mol. Med. 34, 259-264.

Smale, S. T. and Kadonaga, J. T. (2003). The RNA polymerase II core promoter. Annu. Rev. Biochem. 72, 449-479.

Zhang Q. M. (1998). Identification of Human Gene Core Promoters in Silico. Genome Res. 8, 319-326.

Suzuki, Y., Tsunoda, T., Sese, J., Taira, H., Mizushima-Sugano, J., Hata, H., Ota, T., Isogai, T., Tanaka, T., Nakamura, Y., Suyama, A., Salaki, Y., Morishita, S., Okubo, K. and Sugano, S. (2001). Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res. 11, 677-684.

Victor X. J., Gregory AC. S., Francisco J. A., Sandya L. and Ramana V. D. (2006). Genome-wide analysis of core promoter elements from conserved human and mouse orthologous pairs. BMC Bioinformatics 7, 114-127.

Bajic, V. B., Choudhary, V. and Hock, C. H. (2004). Content analysis of the core promoter region of human genes. In Silico Biol. 4, 109-125.

Clifford E. L. (1999). Data Analysis by Resampling: Concepts and Applications. Duxbury Press.

Hughes, J. D, Estep, P. W, Tavazoie S. (2000). Church Computational iden-tification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. Journal of Molecular Biology 296, 1205-1214.

Stothard P. (2000). The Sequence Manipulation Suite: JavaScript programs for analyzing and formatting protein and DNA sequences. Biotechniques 28, 1102-1104.

Vasavada R. C, Wysolmerski J. J., Broadus A. E. and Philbrick W. M. (1993). Identification and characterization of a GC-rich promoter of the human parathyroid hormone-related peptide gene. Molecular Endocrinology 7, 273-282.

Zheng C. and James L. M. (2003). Core Promoter Elements and TAFs Contrib-ute to the Diversity of Transcriptional Activation in Vertebrates. Molecular and Cellular Biology 23, 7350-7362.

Page 13: A STUDY OF PROMOTER ELEMENTS OF HOMO …...178 Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) region. For example, promoter elements TATA box and Initiator always

Proceedings of the South Dakota Academy of Science, Vol. 86 (2007) 189

ACKNOWLEDGMENTS

This research was made possible by NIH Grant Number 2P20 RR016479 from the INBRE Program of the National Center for Research Resources. Its contents are solely the responsibility of the authors and do not necessarily repre-sent the official views of NIH.