166
Statistical Techniques for Examining Gene Regulation A thesis presented by Shane Tyler Jensen to The Department of Statistics in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Statistics Harvard University Cambridge, Massachusetts May 2004

Statistical Techniques for Examining Gene Regulation

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Techniques for Examining Gene Regulation

Statistical Techniques for Examining GeneRegulation

A thesis presented

by

Shane Tyler Jensen

to

The Department of Statistics

in partial fulfillment of the requirementsfor the degree of

Doctor of Philosophyin the subject of

Statistics

Harvard UniversityCambridge, Massachusetts

May 2004

Page 2: Statistical Techniques for Examining Gene Regulation

c©2004 - Shane T. JensenAll rights reserved.

Page 3: Statistical Techniques for Examining Gene Regulation

Thesis Advisor: Professor Jun S. Liu Shane Tyler Jensen

Statistical Techniques for Examining Gene Regulation

Abstract

Genes are often regulated in living cells by proteins called transcription factors

(TFs) that bind directly to short segments of DNA in close proximity to certain tar-

get genes. These short segments have a conserved appearance, which is called a

motif. The experimental determination of TF binding sites is expensive and time-

consuming. Many motif-finding programs have been developed but no program

is clearly superior in all situations, making it difficult to judge which of the motifs

predicted by these algorithms is biologically relevant.

This thesis provides a review of previous approaches to the problem of motif dis-

covery. We derive a comprehensive scoring function based on a full Bayesian

model, which can handle unknown site abundance, unknown motif width, and

two-block motifs with variable-length gaps. In addition, this scoring function for-

mulation enables us to objectively compare different predicted motifs and select

the optimal ones, effectively combining the strengths of existing programs.

An algorithm, BioOptimizer, is proposed to optimize a scoring function, thereby

reducing noise in the motif signal found by any motif-finding program. The accu-

racy of BioOptimizer, when used in conjunction with several existing programs,

is shown to be superior to any of these motif-finding programs alone when eval-

uated by simulation studies and real-data applications in bacteria.

We then propose a Bayesian hierarchical clustering model for the common struc-

ture between a set of discovered motifs. This clustering model is implemented,

iii

Page 4: Statistical Techniques for Examining Gene Regulation

using a Gibbs sampling strategy, on a dataset of 116 TF motifs and several ap-

proaches to analyzing the clustering results are discussed. A Uniform clustering

prior is also considered and is compared to the Dirichlet process prior. Our clus-

tering strategy is general enough to be appropriate and useful in a variety of other

statistical settings.

Finally, our techniques for motif discovery and motif clustering are used in com-

bination to predict co-regulated genes in the bacteria Bacillus subtilis. Sequences

from several closely related species are used to discover motifs conserved by evo-

lution, and these conserved motifs are then used to cluster genes together into

putative co-regulated groups. This clustering is validated and examined in detail

using several external measures of cell regulation.

iv

Page 5: Statistical Techniques for Examining Gene Regulation

Contents

Title page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction and Previous Work 1

1.1 The Biology of Transcription Regulation . . . . . . . . . . . . . . . . 1

1.2 Consensus Sequence Formulation . . . . . . . . . . . . . . . . . . . . 5

1.3 Position-Specific Weight Matrix Formulation . . . . . . . . . . . . . 9

1.4 Motif Discovery for the PSWM Formulation . . . . . . . . . . . . . . 11

1.5 Problems with Existing Motif Discovery Methods . . . . . . . . . . 14

1.6 Modeling Motif Similarity by Clustering . . . . . . . . . . . . . . . . 15

1.7 Combining Motif Discovery and Clustering . . . . . . . . . . . . . . 17

2 Bayesian Motif Discovery Models 21

2.1 A Full Bayesian Model . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Markov Chain Monte Carlo Implementation . . . . . . . . . . . . . 25

2.3 Fixed Number of Sites in A . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Unrestricted Model for A . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5 Dealing with Multiple Motif Types . . . . . . . . . . . . . . . . . . . 29

v

Page 6: Statistical Techniques for Examining Gene Regulation

2.6 Extensions of the Bayesian Motif Model . . . . . . . . . . . . . . . . 30

2.6.1 Variable motif abundance p0 . . . . . . . . . . . . . . . . . . 31

2.6.2 Variable motif width w . . . . . . . . . . . . . . . . . . . . . . 32

2.6.3 Two-Block Motifs . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Scoring Function Optimization 35

3.1 Bayesian scoring functions . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Non-Bayesian scoring functions . . . . . . . . . . . . . . . . . . . . . 38

3.3 Optimizing a scoring function . . . . . . . . . . . . . . . . . . . . . . 39

3.4 Using Scoring Functions to Extend the Model . . . . . . . . . . . . . 43

3.4.1 Overlapping Motif Sites . . . . . . . . . . . . . . . . . . . . . 44

3.4.2 Unknown Motif Site Abundance . . . . . . . . . . . . . . . . 44

3.4.3 Unknown Motif Width . . . . . . . . . . . . . . . . . . . . . . 45

3.4.4 Two-Block Motifs . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5 Detecting Poor Motifs with the Null Score . . . . . . . . . . . . . . . 48

4 Motif Discovery Results 49

4.1 Simulation Comparison of Scoring Functions . . . . . . . . . . . . . 49

4.2 Real Data Comparison of Scoring Functions . . . . . . . . . . . . . . 53

4.3 Simulation Comparison of Motif-Finding Programs . . . . . . . . . 55

4.4 Real Data BioOptimizer Evaluation: One-Block . . . . . . . . . . . . 58

4.5 Real Data BioOptimizer Evaluation: Two-Block . . . . . . . . . . . . 62

4.6 Using Different Motif Width Prior Distributions . . . . . . . . . . . 64

4.7 Special Restrictions on A in Real Data . . . . . . . . . . . . . . . . . 67

5 Bayesian Motif Clustering Model 70

5.1 Hierarchical Framework . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Clustering of Observations . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3 Gibbs Sampling Implementation . . . . . . . . . . . . . . . . . . . . 73

vi

Page 7: Statistical Techniques for Examining Gene Regulation

5.4 Motif Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.5 Clustering of Two-Block Motifs . . . . . . . . . . . . . . . . . . . . . 77

5.6 Advantages of our Clustering Model . . . . . . . . . . . . . . . . . . 78

5.7 Comparison with Other Clustering Priors . . . . . . . . . . . . . . . 79

6 Analyzing Motif Clustering Results 83

6.1 Clustering Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.2 Best Clustering Partition . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.3 Strength of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.4 Examining Particular Clusters in Detail . . . . . . . . . . . . . . . . 92

6.5 Effect of Prior Specification on Clustering Results . . . . . . . . . . 93

6.6 Effect of w on Clustering Results . . . . . . . . . . . . . . . . . . . . 96

7 Prediction of Co-Regulated Genes 101

7.1 Collection of Orthologous Gene Sets . . . . . . . . . . . . . . . . . . 102

7.2 Motif Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3 Clustering Genes Based on Discovered Motifs . . . . . . . . . . . . 110

7.3.1 Validation of Gene Clusters . . . . . . . . . . . . . . . . . . . 112

7.4 Studyset Clustering Results . . . . . . . . . . . . . . . . . . . . . . . 115

7.5 Detailed Examination of Studyset Clusters . . . . . . . . . . . . . . 121

7.6 Whole Genome Clustering Results . . . . . . . . . . . . . . . . . . . 126

7.7 Detailed Examination of Whole Genome Clusters . . . . . . . . . . 128

8 Discussion and Future Work 135

vii

Page 8: Statistical Techniques for Examining Gene Regulation

List of Figures

1.1 Sequence logo of a motif . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Four different discovered motifs . . . . . . . . . . . . . . . . . . . . 15

2.1 Graphical representation of the motif discovery parameters . . . . . 22

2.2 Graphical representation of a two-block motif . . . . . . . . . . . . . 33

4.1 Sequence logo of known CRP sites . . . . . . . . . . . . . . . . . . . 55

4.2 Comparison of different prior width penalty terms . . . . . . . . . . 66

5.1 Comparison of clustering statistics between DP and Uniform priors 82

6.1 Clustering tree for dataset based on a motif width of 8 bps . . . . . 87

6.2 Sequence logos for clusters 1 and 2, with families . . . . . . . . . . . 93

6.3 Clustering statistics between Uniform and DP models . . . . . . . . 94

6.4 Comparison of clustering trees between Uniform and DP models . 95

6.5 Distribution of motif widths in dataset . . . . . . . . . . . . . . . . . 97

6.6 Comparison of clustering trees using different motif widths . . . . 98

7.1 Microarray and sequence-based gene clustering procedures . . . . 102

7.2 Phylogenetic tree of seven related bacterial species . . . . . . . . . . 104

7.3 Flowchart for motif discovery procedure . . . . . . . . . . . . . . . 110

7.4 Clustering tree for studyset joint-block motifs . . . . . . . . . . . . . 116

7.5 Flowchart for studyset motif clustering procedure . . . . . . . . . . 117

viii

Page 9: Statistical Techniques for Examining Gene Regulation

7.6 Distribution of cluster sizes for studyset best partitions . . . . . . . 118

7.7 Graph of connected studyset clusters . . . . . . . . . . . . . . . . . . 122

7.8 Flowchart for genome motif clustering procedure . . . . . . . . . . 127

7.9 Distribution of cluster sizes for whole genome partition . . . . . . . 128

7.10 Graph of connected and significant whole genome clusters, part 1 . 130

7.11 Graph of connected and significant whole genome clusters, part 2 . 133

ix

Page 10: Statistical Techniques for Examining Gene Regulation

List of Tables

1.1 IUPAC nomenclature for consensus sequences . . . . . . . . . . . . 6

1.2 Matrix representations of a motif . . . . . . . . . . . . . . . . . . . . 9

4.1 Simulation comparison of scoring function optimizations . . . . . . 51

4.2 Comparison of scoring function optimizations on the CRP dataset . 54

4.3 Simulation comparison of motif-finding programs . . . . . . . . . . 57

4.4 Comparison of motif predictions for one-block datasets . . . . . . . 61

4.5 Comparison of motif predictions for two-block datasets . . . . . . . 64

4.6 Performance of different motif width priors . . . . . . . . . . . . . . 66

6.1 Protein Families in Dataset . . . . . . . . . . . . . . . . . . . . . . . . 84

6.2 Best partition of clusters for dataset . . . . . . . . . . . . . . . . . . . 90

6.3 Top five clusters for all three motif widths . . . . . . . . . . . . . . . 99

7.1 Bacterial species included in the study . . . . . . . . . . . . . . . . . 103

7.2 Orthologous gene pairs with B.subtilis . . . . . . . . . . . . . . . . . 105

7.3 Sequence distributions for each dataset . . . . . . . . . . . . . . . . 106

7.4 Significant studyset predicted clusters . . . . . . . . . . . . . . . . . 119

7.5 Genome clusters significant on multiple measures . . . . . . . . . . 129

x

Page 11: Statistical Techniques for Examining Gene Regulation

Acknowledgments

This dissertation would not have been possible without the guidance and insight

of my advisor, Professor Jun S. Liu. Although I was able to match his enthuasi-

asm, I was totally incapable of keeping up with all the ideas and possible di-

rections that he would share with me. The result is that I still have a “TO-DO”

list that rivals my thesis in terms of length. I could not have asked for a more

supportive and generous advisor.

Many thanks go to Professor Donald Rubin for his advice and insight as well as

ample amounts of his excellent scotch. Also, thanks to Don, I will never again

attempt to present my implementation method before presenting my model. I

am also grateful to the rest of the statistics faculty for their teaching and helpful

discussions throughout my time here at Harvard.

Also essential to my thesis were the biological applications provided by Profes-

sor Richard Losick and his molecular biology group, especially Patrick Eichen-

berger. Many of the novel statistical techniques I present in this thesis evolved

from the interesting scientific questions posed by Rich and Patrick. I am thankful

for my other collaborators as well, with special thanks to Cristian Castillo-Davis

for helpful discussions and the use of his program GeneMerge, Lei Shen for his

computational assistance, and Xiaole Liu for use of her program BioProspector.

The single person who has borne the brunt of my stress and anxiety over the

last five years is Ms. Aline Normoyle. My work would not have been possi-

ble without her love, support and sense of humour. I also thank my family for

encouraging me and helping to keep my life in perspective.

xi

Page 12: Statistical Techniques for Examining Gene Regulation

I was very lucky to arrive at Harvard at the same time as the statistics ladies,

Liz Stuart and Sam Cook. I can not even speculate where I would be without

their friendship and support. Hosung Kang, Gopi Goswami, and Byron Ellis

arrived the year after we did. All my friends have been patient with me through

five years of paranoid diatribes and the experiences we shared together were

seriously good times. I am also grateful to Jim Greiner, Mayetri Gupta, Nondas

Sourlas, Claudia Pedroza, and the rest of the students in the Harvard statistics

department for their friendship and statistics help.

I am very appreciative for the non-Harvard perspective of my friends Tal Nawy

and Azadeh Akhavan who have been incredibly supportive of me for as long as

I can remember. I also thank my Masters advisor, Professor George Styan, for

encouraging me to continue my graduate education.

Finally, I would like to thank the Boston Red Sox and New England Patriots for

providing me with endless distraction and, in the case of the Pats, inspiration

to succeed. The Red Sox, on the other hand, taught me that you don’t have to

succeed to still be entertaining, and I am, if nothing else, very entertaining.

xii

Page 13: Statistical Techniques for Examining Gene Regulation

Chapter 1

Introduction and Previous Work

1.1 The Biology of Transcription Regulation

The complete information that defines the characteristics of living cells within an

organism is encoded in the form of a moderately simple molecule, deoxyribonu-

cleic acid, or DNA. The building blocks of DNA are four nucleotides, abbrevi-

ated by their attached organic bases as A, C, G, and T. A-T and C-G are com-

plementary bases between which hydrogen bonds can form. A DNA molecule

consists of two long chains of nucleotides that are complimentary to each other

and joined by hydrogen bonds twisted into a double helix. This structure gives

rise to the term “base pair” when describing a DNA sequence. The specific order-

ing of these nucleotides, the “genetic code”, is the means by which information is

stored that completely defines all functions within a cell. With the recent develop-

ment of high-throughput sequencing technology, the National Institute of Health

genetic sequence database, GenBank, has sustained an exponential growth rate

since 1982. Right now GenBank contains the complete genomic sequences of over

1000 organisms (Benson et al., 2002) with approximately 22 billion DNA bases.

The central dogma of molecular biology dictates that certain segments of the

1

Page 14: Statistical Techniques for Examining Gene Regulation

DNA (i.e., genes) are transcribed into another molecule, RNA, which serves as

a transient template to make the basic building blocks of cellular life, proteins.

Although all the cells in the same organism possess exactly the same DNA se-

quences (i.e., genetic information), they display different physiological character-

istics in different tissues, developmental stages, and environmental conditions.

This “differentiation” is caused by the differences among the collections of pro-

teins that are synthesized in different cells or at different cell states. If a protein

is being synthesized at a certain state, its coding DNA (called a gene) is termed

as “active” or “expressed”. Thus, a cell in a particular physiological state can be

roughly viewed as a mechanical system where each different gene is switched

either on (active) or off (inactive).

In many organisms, the DNA that codes for proteins (genes) is only a small por-

tion of the total genomic DNA. For example, genes make up only about 1.5% of

the human genome (IUPAC, 1986). The non-coding components of DNA, which

were initially considered as “junk” sequences, actually contain the control mech-

anisms for activating and deactivating the genes, and thus the synthesis and non-

synthesis of proteins. Most of the control sequences for a gene lie in the upstream

regulatory region, which is the region of a few hundred or thousand base pairs

directly before the gene. Transcribing or activating a gene requires not only the

DNA sequence in the upstream region, but also many proteins called transcrip-

tion factors (TF). When these TFs are present, they bind to specific DNA patterns

in the upstream sequence of genes, and either induce or repress the transcription

of these genes by recruiting other necessary proteins (Lodish et al., 1995).

One transcription factor can bind to many different upstream regions, thus regu-

lating the transcription of many genes. The binding sites of the same transcrip-

2

Page 15: Statistical Techniques for Examining Gene Regulation

tion factor show a significant sequence conservation, which is often summarized

as a short (5-20 bases long) common pattern called a transcription factor binding

motif (TFBM) or binding consensus, although some variability is tolerated. It is

the main focus of this thesis to discover the locations of these motif sites and to

model the common patterns shared by different motifs.

In prokaryotes (lower organisms without nuclei), there are fewer TFs, their motifs

tend to be relatively long, and the strength of regulation for a particular gene

often depends on how close a particular site matches the consensus for the motif.

The more mismatches to the consensus in a binding site, the less often the TF will

bind and therefore the less control it will exert on the target gene. The variability

between sites is sometimes crucial to the regulatory process, since TF binding

sites that are perfect matches to the optimal pattern would bind the TF too tightly,

preventing the subsequent steps of transcription (Pfahl, 1981).

In eukaryotes (higher organisms with nuclei), many more transcription factors

are involved in the regulation of a gene, and their binding motifs tend to be

shorter. Eukaryotic upstream regions usually contain regulatory modules, a col-

lection of adjacent binding sites (sometimes multiple binding sites) of several

transcription factors. Transcription regulation not only relies on the combina-

tion of the TFs involved, but also on the number of site copies in the upstream

regions (Werner, 1999).

Characterizing the motifs of TFs and locating TF binding sites are crucial tasks

for understanding how the cell regulates its genes in response to developmental

and environmental changes. However, the gold standard experimental proce-

dures to determine binding sites are inefficient, sometimes impractical, and can

only discover one transcription factor binding site at a time. With the availabil-

3

Page 16: Statistical Techniques for Examining Gene Regulation

ity of complete genome sequences, biologists are using techniques such as DNA

microarray (Schena et al., 1995) or serial analysis of gene expression (Velculescu

et al., 1995) to measure the expression level of every gene in an organism in vari-

ous conditions.

A genome can be divided into gene clusters according to similarities in their gene

expression (Eisen et al., 1998). Genes in the same expression cluster respond simi-

larly to environmental and developmental changes and thus may be co-regulated

by the same TF or the same group of TFs. Therefore, our computational analy-

sis can be focused on the search for TF binding sites in the upstream of genes

contained in a particular cluster.

Another experimental procedure called Chromatin Immuno-Precipitation follow-

ed by microarray (ChIP-array or ChIP-on-chip) can measure where a particular

TF binds to DNA in the whole genome, although at a coarse resolution of 1-2

kbps. Again, computational analysis is required to pinpoint the short binding

sites of a transcription factor from all the long TF binding targets.

A focused version of these experiments involves a comparison between normal

(“wild-type”) organisms in a particular species and mutant organisms that have

has a specific regulatory protein “knocked-out” of their genome. These mutant

organisms can not produce this particular regulatory protein of interest, and so

whichever genes are normally regulated by this protein will not be regulated in

these mutant organisms. Thus, any genes that show large differences in gene

expression (as measured by DNA microarrays) are considered as possible targets

of this regulatory protein.

However, in practice it is often difficult to measure differential expression from

microarray data (Tseng et al., 2001), and often arbitrary expression thresholds are

4

Page 17: Statistical Techniques for Examining Gene Regulation

used to classify genes as either differentially regulated or not. As a consequence,

the set of genes that is used in order to search for TF binding sites can contain

several “false-positive” sequences corresponding to genes that were judged to be

under the control of a protein of interest, but in reality are not. Thus, the dis-

covery of TF binding sites can serve as an important validation technique when

attempting to elucidate the set of genes controlled by a particular protein.

With the ever expanding number of whole genomes sequenced and high through-

put gene expression and protein-DNA binding data, motif finding and transcrip-

tion regulatory network elucidation have become major research topics in com-

putational biology.

There are two ways of discovering novel binding sites of a TF: scanning meth-

ods and de novo methods. In a scanning method, one uses a motif representation

resulting from experimentally determined binding sites to scan the genome se-

quence to find more matches. In de novo methods, one attempts to find novel

motifs that are “enriched” in a set of upstream sequences. This thesis focuses on

the latter class of methods. The de novo methods can also be divided into two

classes, according roughly to two general data formulations for representing a

motif: the consensus sequence or a position-specific weight matrix (PSWM).

1.2 Consensus Sequence Formulation

The consensus sequence shows the motif as a string of IUPAC (1986) characters as

shown in Table 1.1. For example, the Mse motif consensus CRCAAAW suggests

that the Mse protein binds to sites starting with a C, followed by A or G, followed

by CAAA, and followed by A or T. In this section, we use word and segment inter-

5

Page 18: Statistical Techniques for Examining Gene Regulation

changeably to mean a short DNA sequence being tested by our motif model as a

potential binding site. When scanning a set of sequences against a consensus, all

words matching the consensus are considered putative binding sites. This some-

times results in many false positive sites, and it may miss some true sites with

variability that isn’t represented by the sequence.

Table 1.1: IUPAC nomenclature for consensus sequencesA Adenine C CytosineG Guanine T ThymineR Purines (A,G) Y Pyrimidines (C,T)W Weak hydrogen bond (A,T) S Strong hydrogen bond (C,G)M Amino Group (A,C) K Keto Group (G,T)B not A (C,G,T) D not C (A,G,T)H not G (A,C,T) V not T (A,C,G)N any (A,C,G,T)

Early research on discovering motifs was usually simplified to finding a sequence

pattern enriched or over-represented in the sequence dataset compared to the

genome background. Therefore, many computational algorithms for finding mo-

tif consensus sequences adopted a “pattern-driven” or “word enumeration” ap-

proach by enumerating predefined consensus patterns to see which is signifi-

cantly enriched in the sequence dataset.

The first consensus sequence enumeration method was developed (Galas et al.,

1985) to search for a TATA-box motif that appears once in each upstream region.

They first align all the upstream sequences at the transcription start site. Then for

every aligned position, they search in the 9-base windows centered at that posi-

tion of all the sequences. In this window, every possible pattern bi of width 6 is

scored according to: S(bi) = (6/6)qi6 + (5/6)qi5 + (4/6)qi4, where qik is the number

of sequences whose best matching 6-mer (subsequence of length 6) to bi in the

6

Page 19: Statistical Techniques for Examining Gene Regulation

9-base window has k matched-positions. The highest scoring pattern is consid-

ered as a potential motif and the positions corresponding to this are considered

potential binding locations.

In most motif finding problems, the binding site locations are unknown and

their distances from the transcription start site vary extensively. Therefore, oligo-

analysis (van Helden et al., 1998) was developed to find sequence patterns en-

riched in the whole upstream region. This method enumerates every possible

pattern bi of certain width to determine whether it occurs in the dataset more

than expected. Sinha and Tompa (2000) later extended this method to allow for

one-base mismatch and to use the IUPAC alphabet to find motifs with more flex-

ible base substitutions. To speed up computation, Sinha and Tompa calculated

the mean and variance of the number of occurrences of bi and determined its

significance by a Z-test. Their calculations were based on a 3rd order Markov

model for non-coding sequences in the genome. As shown in Liu et al. (2001),

the Markov model discriminates against meaningless patterns such as AAAA or

ATAT that are frequently found in the non-coding sequences and therefore in-

creases the specificity of the discovered motifs.

The time to enumerate all possible consensus patterns increases exponentially as

the pattern width increases, so finding longer motif patterns is a challenge. Since

many long motifs are more conserved near the two ends, van Helden et al. (2000)

proposed to detect long motifs as spaced dyad patterns such as w1 ·ns ·w2, where

w1 and w2 are the dyad motif words with a short enough widths, and ns is the s-

base spacer of unspecified sequence. The expected occurrences of a spaced dyad

can either be calculated from joint distribution of w1 and w2 assuming that w1 and

w2 are conditionally independent, or by counting w1 · ns · w2 occurrences in the

7

Page 20: Statistical Techniques for Examining Gene Regulation

whole genome non-coding sequences.

Another method encodes nucleotides using a 2-bit binary number instead of an 8-

bit character, and converts the sequence into a much shorter array for quick access

(Hampson et al., 2000). A third method uses a suffix tree to represent all patterns

of all widths that exist in the whole genome non-coding regions (Brazma et al.,

1998). Keich and Pevzner (2002) introduce models for more refined consensus

pattern searching, which are useful in in the case of very subtle motifs. Each node

contains a sequence pattern that reflects the path from the root to the node, and

stores information of the count and location of all the sequences matching that

pattern. In addition, each node can branch into A, C, G, and T to form patterns

one base longer. Although building the full tree is extremely time and memory

intensive, one can trim many “rare” nodes to speed up tree-building.

A recent method called MobyDick builds longer motifs from concatenating shorter

ones (Bussemaker et al., 2000). MobyDick models the sequence dataset as being

generated by concatenations of words drawn independently from a dictionary

with their respective “usage” frequencies. The initial motif dictionary contains

individual bases A, C, G and T, with their frequencies estimated from genome

non-coding sequences. Longer patterns are formed by adding into the dictionary

those concatenated word pairs that have occurred more than expected (e.g., “CG”

would be treated as a new word if its occurrence is significantly more than what

is expected from the independent pairing). The frequencies are re-estimated for

all the words in the new dictionary to maximize the likelihood of generating the

sequence dataset. The process is repeated until no new words can be added. This

method has recently been generalized to a stochastic dictionary model (Gupta

and Liu, 2003).

8

Page 21: Statistical Techniques for Examining Gene Regulation

1.3 Position-Specific Weight Matrix Formulation

An alternative motif formulation is a position-specific weight matrix (PSWM) or

simply motif matrix, which measures the desirability of each base at each position

of the motif. The simplest matrix is an alignment matrix Njk, which records the

occurrence of base k at position j of all the aligned sites for this motif (Table 1.2).

Also shown in Table 1.2 is the corresponding frequency matrix (fjk = Njk/N),

where N is the number of motif sites, and weight matrix log[fjk/θ0k] (Hertz and

Stormo, 1999), where θ0k is the proportion of base k in the non-motif (background)

positions.

Table 1.2: Matrix representations of a motif

Alignment matrixPos A C G T

1 0 4 7 12 2 1 8 13 0 0 12 04 12 0 0 05 0 0 0 126 0 0 0 127 12 0 0 08 6 1 2 3

Frequency matrixA C G T

0.00 0.33 0.58 0.080.17 0.08 0.67 0.080.00 0.00 1.00 0.001.00 0.00 0.00 0.000.00 0.00 0.00 1.000.00 0.00 0.00 1.001.00 0.00 0.00 0.000.50 0.08 0.17 0.25

Weight matrixA C G T

-2.6 0.3 0.8 -1.0-0.4 -1.0 0.9 -1.0-2.6 -2.6 1.3 -2.61.3 -2.6 -2.6 -2.6-2.6 -2.6 -2.6 1.3-2.6 -2.6 -2.6 1.31.3 -2.6 -2.6 -2.60.7 -1.0 -0.4 0.0

Schneider and Stephens (1990) used the position-specific weight matrix to con-

struct a Sequence Logo as a means by which to visualize the appearance of the mo-

tif. Figure 1.1 gives the sequence logo corresponding to the matrix in Table 1.2.

The height of each position is equal to its information content (∑

k fjk log[fjk/θ0k])

and the size of each letter is proportional to the letter’s relative frequency.

A formal statistical model for the position-specific weight matrices was described

in Lawrence and Reilly (1990) and a complete Bayesian method was given in

9

Page 22: Statistical Techniques for Examining Gene Regulation

0

1

2

1T

CG

2T

C

A

G

3G 4A 5T 6T 7A 8T

A

Figure 1.1: Sequence logo of a motif

Liu (1994) and Liu et al. (1995). In this model, the sequence data is represented

as an array S, where Sij is the base in position j of sequence i. Each base can

take on K = 4 different values corresponding to the nucleotides A, C, G, and

T. To reflect the fact that the motif sites within S are substrings of length w that

are conserved relative to each other, we model them as independent realizations

from a common Motif model. That is,

(r1, . . . , rw) ∼ ProductMultinomial(Θ = (θ1, θ2, . . . , θw))

if (r1, . . . , rw) is an observed motif site in S, where θi = (θjA, θjC, θjG, θjT ) is a

probability vector for the preference of the four nucleotide types in position j.

This model means that, for example, the motif site “TTACTAA” is generated with

probability θ1T θ2T θ3Aθ4Cθ5T θ6Aθ7A.

The remainder of the sequences are classified as nonsites or background, for

which the simplest model is the multinomial distribution with the “null” fre-

quency θ0 = (θ0A, . . . , θ0T ). Since the motif sites are only a tiny fraction of the

whole sequence data, we can estimate θ0 first (e.g., direct counting of the 4 nu-

cleotide types) and subsequently treat it as known. It has been shown recently

that using a Markov chain to model the nonsite positions can improve the motif

specificity (Liu et al., 2001).

From the alignment of a set of binding sites, we can easily derive a frequency

matrix fjk, which is the MLE of θjk, and the weight matrix given in Table 1.2.

10

Page 23: Statistical Techniques for Examining Gene Regulation

These matrices can be used to scan the whole genome sequence, by computing

for each segment the likelihood of that segment being generated from the motif

model, to discover novel realizations of the binding motif. This strategy tends to

be more accurate in capturing the correct sites than using the matching criterion

based upon the consensus sequence formulation.

1.4 Motif Discovery for the PSWM Formulation

In a majority of gene regulation analysis problems, we know neither the locations

of the motif sites nor the motif pattern (i.e., Θ or an estimate of it). Thus, we need

to simultaneously estimate the motif matrix and locate the possible motif sites in

the sequence data. A particularly successful class of computational algorithms

for this problem adopts a “data-driven” or “matrix update” approach based ei-

ther on the EM algorithm (Dempster et al., 1977) or Gibbs sampling (Geman and

Geman, 1984). These methods typically initiate a motif matrix randomly and use

the sequence dataset to gradually refine the motif. It is the focus of Chapters 2-3

to give an overview and extension of this class of algorithms, providing for them

a rigorous Bayesian foundation, and to discuss possible improvements.

The first algorithm for discovering novel motifs was Consensus (Stormo and

Hartzell, 1989). Assuming that each sequence contains one motif site, the algo-

rithm starts by examining all possible locations of the motif sites in the first two

sequences (a total of (n1 − w + 1)(n2 − w + 1) comparisons), and chooses the top

X pairs of motif sites according to the relative entropy scores of their correspond-

ing motif matrix, where the score is defined as ψENT =∑w

j=1

∑Tk=A fjk log fjk/θ0k,

where fjk is frequency matrix and log fjk/θ0k is the weight matrix entry given in

Table 1.2.

11

Page 24: Statistical Techniques for Examining Gene Regulation

Later, another scoring function was deduced to estimate the p-value of each mo-

tif, which is the probability of observing a motif from random alignment of the

same size that scores equally or higher (Hertz and Stormo, 1999). Only motifs

with high information content or low p-value are retained, and each is aligned

with every possible w-mer (subsequence of length w) in the third sequence to

form a set of new matrices and the top K matrices are retained. The algorithm

cycles through all the sequences in the same fashion and the best-scoring motifs

are reported at the end as potential TF binding motifs. When there are more motif

sites in the first few sequences in the dataset, especially the first two sequences,

Consensus is effective. Otherwise, a number of runs using different sequence

orders are needed.

Lawrence and Reilly (1990) developed another matrix motif discovery algorithm

based on a missing data formulation, which will be detailed in the next chapter,

and the EM algorithm (Dempster et al., 1977). The original algorithm restricts

each sequence to contain one TF site. A later method called Meme (Bailey and

Elkan (1994); Grundy et al. (1996)) overcomes this limitation by introducing a

prior probability for every position to be the start of a motif site. The algorithm

also uses every existing w-mer in the sequence data set to initialize the EM it-

eration, thus improving the convergence properties of the original method of

Lawrence and Reilly (1990).

About the same time, a Bayesian method and several related Gibbs sampling al-

gorithms for motif discovery were also developed (Lawrence et al. (1993); Liu

(1994); Liu et al. (1995)) and these Bayesian approaches together with powerful

Markov chain Monte Carlo tools demonstrate more modeling and computational

flexibilities. For example, many new methods have been explored to extend the

12

Page 25: Statistical Techniques for Examining Gene Regulation

functionality of Gibbs sampling. Gibbs Motif Sampler incorporates a prior prob-

ability of motif occurrence in the sampling, thus allowing variable motif sites in

each input sequence (Liu et al., 1995). By only considering the k positions out of

w in the motif with the richest information content, it allows the motif to contain

small gaps.

AlignAce continues to improve the Gibbs Motif Sampler by iteratively mask-

ing out aligned sites in order to find multiple different motifs (Roth et al., 1998).

BioProspector is a Gibbs sampler that uses a Markov model estimated from the

whole genome non-coding sequences to represent the non-motif background in

order to improve the motif specificity (Liu et al., 2001). It can also find motifs that

have two conserved blocks separated by a non-conserved gap of variable length.

All of these procedures are more or less statistically formulated, in contrast to the

word enumeration methods of van Helden et al. (1998),van Helden et al. (2000),

Sinha and Tompa (2000), Hampson et al. (2000), and Brazma et al. (1998).

Algorithms based on word matches are usually exhaustive in finding motifs, but

are limited by the maximum width of the motif that can be enumerated. Pro-

grams based on matrix update algorithms can find motifs of any specified width,

but none can guarantee convergence or a globally optimal motif. To strike a bal-

ance of the two, a recent algorithm MDscan (Liu et al., 2002) first uses a word

enumeration method to search motifs from the top L sequences that biologists

are most confident contain the motif. Using every existing w-mer in these se-

quences as a seed, MDscan finds all w-mers in the L sequences that are similar to

the seed and constructs from them a motif matrix. All the motif matrices are eval-

uated by a semi-Bayesian scoring function and the best ones are further refined

using all the sequences in the data set. When the motif is weak and the data are

13

Page 26: Statistical Techniques for Examining Gene Regulation

noisy, searching for motifs first from sequences with high signal to background

ratio increases the chance of success.

An extensive presentation of Bayesian motif discovery models as well as possible

model extensions is given in Chapter 2.

1.5 Problems with Existing Motif Discovery Methods

These algorithms are all fairly fast, easy to use, and reasonably accurate, although

their relative performances do vary depending on the real-data situation since

each is implementing a different model. In addition, the results from stochastic

motif-finding algorithms can vary between independent runs. When these al-

gorithms do give different motif predictions, practitioners have a difficult time

deciding which results are “best” for a real data situation.

In addition, each model has certain limitations, for example, the need to input a

site abundance parameter, restrictions on the number of sites per sequence, or a

fixed motif width.

In Chapter 3, we present a scoring function optimization approach that provides

a principled means by which to compare and improve motif site predictions from

previous motif-finding algorithms. The scoring function approach has the advan-

tage of being simple to understand as well as easy to implement and extend to

eliminate several tenuous assumptions. Our procedure, implemented in the pro-

gram BioOptimizer, can be used in conjunction with any motif-finding programs

currently available and to compare different prediction results.

In Chapter 4, we demonstrate improved motif-finding accuracy for BioOptimizer

over other motif-finding programs in both simulation and real-data studies. We

14

Page 27: Statistical Techniques for Examining Gene Regulation

also show that BioOptimizer can provide extra flexibility compared to other motif-

finding programs, e.g., inferring the motif site abundance parameter and the mo-

tif width, and allowing for motifs consisting of two conserved blocks separated

by a variable-length gap of non-conserved nucleotides.

1.6 Modeling Motif Similarity by Clustering

Although the discovery and characterization of a single motif is often the goal of

a particular biological investigation, it is common for biologists and statisticians

to be interested in examining the similarities and differences between an entire

collection of discovered TF motifs. Figure 1.2 shows the sequence logos for four

motifs that resulted from four separate motif discovery procedures.

Tal1beta-E47S AGL3

0

1

2

bits

1 2TACG 3T

GA 4A

C

5G

C 6A 7CGT

8AC

9T 10T

G

11CGT

12GCAT

0

1

2

bits

1GT

C 2ATC 3G

CTA 4C

GAT 5G

CTA 6C

TA

7GCTA 8C

GAT 9T

CGA

10CTG

ARNT MEF2

0

1

2

bits

1 2 3AC 4G

A 5C 6G 7T 8G 9 10 0

1

2

bits

1ATC 2T 3C

A 4CAT 5A

T 6AT 7T

A

8AT 9G

A 10TAG

Figure 1.2: Four different discovered motifs

It is clear that there exists some similarity or common structure between some of

these discovered motifs, and one could argue for grouping of Tal1beta-E47s and

ARNT together (based on the common CA and TG positions) along with a possi-

ble grouping for AGL3 and and MEF2 (based on the final four ATAG positions).

However, this grouping strategy is based on ad-hoc personal judgement.

15

Page 28: Statistical Techniques for Examining Gene Regulation

The statistical problem of interest here is to model the common structure between

these different motifs and find a principled means by which to group, or cluster

motifs together based upon their similarity.

This general idea of grouping similar motifs has been applied by Gordon et al.

(2004), where structural and biochemical database information is used to group

motifs in order to further improve motif discovery. However, this method is

labor-intensive and requires substantial additional information beyond just se-

quence information. We desire a statistical approach that utilizes only the se-

quence data available in our usual motif discovery setting.

There are several traditional statistical techniques for clustering observations to-

gether which are reviewed in Hartigan (1975). Hierarchical Tree Clustering joins ob-

servations together into successively larger clusters based upon some sort of sim-

ilarity measure. K-means Clustering groups observations into a pre-determined

number of clusters by minimizing a within-cluster distance measure.

Each of these techniques have elements that are not ideally suited for our desired

goal of motif clustering. Hierarchical tree clustering requires the user to specify

a distance metric between the observations (in this case, motif matrices), and it

is not clear for comparing motifs what type of simple distance metric should be

used. In addition, the result of this algorithm is a tree that joins all observations

together, and it is not clear where the tree should be “cut” in order to produce

a set of clusters. K-means clustering is useful in situations where the number of

clusters is known a priori, but this is also not the case here, since we have very

little idea how many motifs might cluster together in a particular collection of

motifs.

In addition, these techniques consider the given observations as fixed and known,

16

Page 29: Statistical Techniques for Examining Gene Regulation

which is not the case for our applications where each motif in a collection is only

an estimate generated by whatever motif discovery procedure used.

Recognizing that our discovered motifs themselves are estimated parameters, we

need to model both within-motif and between-motif variability. In Chapter 5, we

outline a Bayesian hierarchical clustering model that encompasses both levels of

variability and does not require prior knowledge of the number of clusters.

We present an implementation of this model based upon Gibbs sampling. In

addition to eliminating the clustering problems mentioned above, our stochastic

implementation strategy allows us to examine not only optimal clustering results,

but also the variability in those clustering results.

In Chapter 6, we present various techniques for summarizing and understand-

ing the results from our clustering procedure, with an application to a dataset

containing 116 TF motifs.

1.7 Combining Motif Discovery and Clustering

Biologists are interested in inferring the regulation network of a living cell by

deducing the sets of genes that are co-regulated by specific transcription factors.

As mentioned in Section 1.1 above, microarray data is often collected in order

to deduce co-regulated clusters of genes based on similality of gene expression

patterns. Often, gene-knockout experiments are used to detect genes regulated

by a specific known transcription factor, as in Eichenberger et al. (2003) and Molle

et al. (2003). In other studies, such as Eisen et al. (1998), genes are grouped into

co-regulated clusters based on similarity of gene expression patterns over a time-

series of experiments.

17

Page 30: Statistical Techniques for Examining Gene Regulation

In either case, microarray data collection is expensive and time-consuming. As

well, gene expression studies are restricted to a limited number of species for

which microarray chips have been designed. For these reasons, it would be

very beneficial to biologists to develop computational techniques for inferring

sets of co-regulated genes that avoids the need for gene expression information

and instead utilizes more widely available data sources, namely genomic DNA

sequence information.

When the sequence information from several closely-related species is available,

an alternative motif discovery strategy is to look for binding sites that are con-

served between sets of orthologous genes across different species, rather than

across different genes within the same species. Genes which are orthologous to

each other produce proteins with the same function in their respective species

and therefore are probably regulated in a similar fashion.

This strategy, often referred to as “phylogenetic footprinting”, is based upon the

idea that subsets of genomic DNA that are biologically important are more likely

to be conserved by evolution. Genes are examples of DNA sequences which are

usually conserved by evolution, since the proteins which are produced by these

genes can be drastically altered by changes in the gene sequence. Similarily, we

expect that transcription factor binding sites will be conserved by evolution, since

any changes to the binding site sequence that alters the ability of the TF to bind

could have a dramatic effect on gene regulation.

Phylogenetic footprinting has the advantage that clusters of co-regulated genes

do not have to be inferred beforehand (eg. by microarray data), since we are

looking for motifs that are conserved across species instead of across genes within

a single species. The disadvantage of this method is that the complete sequence

18

Page 31: Statistical Techniques for Examining Gene Regulation

information from several related species must be known and orthologous genes

within these species must be identified.

Fortunately, the genomes of many related bacterial species have been completely

sequenced and are available publicly from the National Center for Biotechnology

Information (www.ncbi.nlm.nih.gov). McCue et al. (2001) used the sequence

information from 9 bacterial species to identify TF-binding sites in Escherichia coli.

Our organism of interest is the bacterium Bacillus subtilis, which also has several

related species for which complete genome information is available.

Building on top of the concept of phylogenetic footprinting is the idea that many

of the motifs discovered within each of these orthologous gene sets will be sim-

ilar enough in appearance that we will be able group them into clusters. If the

motifs found upstream of several Bacillus subtilis genes are similar enough to be

clustered together, then it is possible that the same TF (recognizing that common

motif) is targeting each of the genes in that cluster.

Thus, by combining statistical techniques for both motif discovery and motif clus-

tering, we can infer potentially co-regulated gene clusters solely on the basis of

the sequence information from several closely-related species.

Qin et al. (2003) used a similar combination of motif discovery and clustering

framework and the motifs discovered by McCue et al. (2001) to cluster Escherichia

coli genes into potentially co-regulated clusters. Their motif-discovery procedure

was restricted to single-block motifs with fixed width even though many bacterial

transcription factors binding motifs consist of two blocks with a variable-length

gap. As well, very little external information was used to validate the gene clus-

ters which were inferred by their procedure. Wang and Stormo (2003) introduce

an algorithm, Phylocon, which combines sequence information between related

19

Page 32: Statistical Techniques for Examining Gene Regulation

species with sequence information between co-regulated genes within a single

species to improve motif discovery. Although it was not their intended goal,

their general framework (comparing motifs between genes that were discovered

by cross-species sequence comparison) is similar to our strategy for inferring co-

regulated genes.

As a final application of our improved methods for motif discovery (Chapters 2-

4) and motif clustering (Chapters 5-6), we will use a combined procedure to pre-

dict co-regulated genes in the bacteria Bacillus subtilis. Our model extensions for

motif discovery permits us to focus not just on the discovery of single-block mo-

tifs but also allow for two-block motifs with a variable-length gap. As well, in

addition to an optimized motif signal (Chapter 3), we allow for variable motif

width and unknown motif abundance. Our clustering model is designed to al-

low for both one and two-block motifs, which will also enable us to study the

interaction between clusters that are formed based upon either one-block or two-

block discovered motifs.

In addition, using B. subtilis as our target organism means that we can utilize

external information about gene regulation in B.subtilis to validate our inferred

gene clusters. We will use four different validation measures, based upon gene

expression data, functional classifications, and known TF interactions, to test our

gene clusters for biological relevance.

20

Page 33: Statistical Techniques for Examining Gene Regulation

Chapter 2

Bayesian Motif Discovery Models

2.1 A Full Bayesian Model

As in Chapter 1, we let S denote the set of sequences under investigation, where

each Sij takes value in an alphabet of size K (K=4 for DNA sequences). Within

S, we postulate that there are substrings (r1, . . . , rw) of length w that are sites of

an unknown motif model.

The locations of these sites are unknown, so we introduce a missing array of

indicators A, where Aij is either one or zero indicating whether or not position j

in sequence i is the starting point of a motif site.

The composition of the motif is represented by the frequency matrix Θ, where

θjk is the frequency of nucleotide k in column j of the motif. The nucleotide

composition of the background (portions of the sequences that are not motif sites)

is represented by the vector θ0 where θ0k is the frequency of nucleotide k in the

background. A graphical representation of these quantities is given in Figure 2.1.

A particular realization of A (ie. a particular set of motif sites) allows us to break

21

Page 34: Statistical Techniques for Examining Gene Regulation

Sequence Data S Site Indicators A Motif Θaaaacatcgatacctacttttggtcgt 000000000001000000000000000 θ1a θ2a · · · θwa

aacctacgtctagcatcgaaatcgacg 010000000000000000000000000 θ1c θ2c · · · θwc

aattatgctacgtacgcggtcgtacgt 000000000000000000010000000 θ1g θ2g · · · θwg

θ1t θ2t · · · θwt

Figure 2.1: Graphical representation of the motif discovery parameters

our sequence data S to two parts, one which consists only of the bases in the motif

sites, and the complementary subset which are the remaining background bases.

We let N be the matrix of the counts of the different nucleotides in all of the motif

sites i.e.,Njk is the number of sites with nucleotide k (k = 1, . . . , 4) in position j of

the motif. For now, we assume that the motif width w is known so that N and Θ

have fixed dimensions of w× 4. We will discuss generalizations to variable motif

width later in Chapter3.

As mentioned in Chapter 1, we assume that each motif site is an independent

realization from a Product-Multinomial distribution parameterized by Θ, which

means that each vector of column counts Nj = (Nj1, . . . , Nj4) independently fol-

lows a multinomial distribution parameterized by θj = (θj1, . . . , θj4)

(Nj1, . . . , Nj4) ∼ Multinomial((θj1, . . . , θj4)),

The corresponding vector of background nucleotide counts is denoted by N0

where N0k is the count of nucleotide k in the background portion of the sequence

dataset. The simplest model for the background counts is that every background

nucleotide is an independent realization from a Multinomial distribution param-

eterized by θ0

(N01, . . . , N04) ∼ Multinomial((θ01, . . . , θ04)),

22

Page 35: Statistical Techniques for Examining Gene Regulation

Viewing A as missing data, we can write down the likelihood of S as

p(S | Θ, θ0,A) ∝ θN0

0 ×

w∏

j=1

θNj

j

with the notation θNj

j =∏4

k=1 θNjk

jk . To enable a Bayesian analysis, we employ the

following conjugate prior distributions for Θ and θ0,

Θ ∼ ProductDirichlet(B = (β1, . . . ,βw)) and θ0 ∼ Dirichlet(β0)

where βj = (βj1, . . . , βj4). For a brief review of multinomial models with Dirichlet

prior distributions, refer to Gelman et al. (1995).

Our Dirichlet prior parameters B = (βjk) and β0 = (β0k) can be interpreted as

a matrix of pseudo-counts which are being added to our motif count matrix N

and background count vector N0. This can be seen in the conditional posterior

distribution

p(Θ, θ0 | S,A,B,β0) ∝ θN0+β0−10 ×

w∏

j=1

θNj+βj−1j

We can consider more general models for our background counts N0, such as

modelling each background nucleotide as a realization from a l-th order Markov

chain (empirically l=3 works the best). In this more general situation, we can

write the above model as

p(Θ, θ0 | S,A,B) ∝ p(N0 | θ0) × p(θ0) ×

w∏

j=1

θNj+βj−1j

where θ0 now denotes the parameters in the background Markov model, and

p(θ0) is some prior distribution for these parameters.

In general, it is relatively easy to estimate the background parameters θ0 since the

vast majority of the sequence dataset is background sequence. For this reason, we

23

Page 36: Statistical Techniques for Examining Gene Regulation

will assume that our background parameters θ0 are fixed and known a priori. For

simplicity of exposition, we will assume the simple Multinomial model for N0 as

a realization of θ0, though the models that follow are easily generalized to more

complicated background models.

The model has thus far been described as conditional on a particular set of known

motif sites A, but in reality the matrix of site locations A is also unknown and

should also be considered as a set of random variables. In the Bayesian frame-

work, we prescribe a particular prior distribution for our unknown A, which we

will assume is a priori independent of our other set of unknown parameters, the

motif frequency matrix Θ.

In following sections, we will describe several specific prior distributions for A,

but generally, we now have the following joint posterior distribution of our un-

known motif site locations and motif frequency matrix:

p(Θ,A | S, θ0,B) ∝ p(S | Θ,A, θ0) × p(Θ|B) × p(A)

∝ θN0

0 ×w∏

j=1

θNj+βj−1j × p(A)

Our goal for motif discovery is inference based upon this joint posterior distribu-

tion. For those more comfortable with the likelihood framework, this posterior

inference is equivalent to maximum likelihood inference under vague prior infor-

mation. There are advantages to using the Bayesian framework, however, since

it allows for the easy incorporation of prior information and for removal of nui-

sance parameters.

24

Page 37: Statistical Techniques for Examining Gene Regulation

2.2 Markov Chain Monte Carlo Implementation

In a typical data augmentation-based Gibbs sampling algorithm (Tanner and

Wong, 1987), the desired posterior distribution p(Θ,A | S, θ0,B) can be simu-

lated by starting with arbitrary initial values of the unknown parameters Θ0, and

then for t = 0, 1, . . ., iteratively sampling from the two conditional distributions:

1. p(At | Θt, θ0,S,B);

2. p(Θt+1 | At, θ0,S,B).

Given enough time steps, the draws simulated in this fashion will converge to

draws from the desired posterior distribution.

Typically, we are most interested in the draws from p(A | S, θ0,B) which would

indicate the most likely positions of the unknown conserved sites. For this reason,

and since Θ is a high dimension (w×4) matrix, drawing the Θ parameters at every

iteration can be both time-consuming and inefficient.

As demonstrated by Liu (1994), the algorithm can be improved by integrating

over Θ so that we can simulate draws via Gibbs sampling from the posterior

distribution p(A | S, θ0,B) directly, where

p(A | S, θ0,B) =

p(Θ,A | S, θ0,B) dΘ

We now give variations on the basic motif model under different assumptions

and the algorithmic consequences of these assumptions. First, we present the

simplest model where the total number of sites is fixed. Then, we present an

improved model where the total number of sites is allowed to vary. We briefly

25

Page 38: Statistical Techniques for Examining Gene Regulation

discuss extending the model to multiple motifs. Finally, we discuss relaxing the

assumptions of fixed motif abundance and motif width.

2.3 Fixed Number of Sites in A

In the early methods (e.g., Lawrence and Reilly (1990); Cardon and Stormo (1992);

Lawrence et al. (1993)), it was assumed that each sequence must contain one and

only one motif site, which corresponds to assuming that Aij = 0 for all but one

entry in the ith row. Thus, no explicit prior distribution for A is needed if we

suppose that the motif site can be anywhere in the sequence with equal probabil-

ities. These algorithms, as described in Lawrence et al. (1993) and Liu (1994), are

based on the following assumptions

(a) there is only one type of motif present in the sequence data, and

(b) there is one and only one motif site per sequence.

In this case, the missing indicator array A reduces to a vector a = (a1, . . . , am)

where ai gives the location of the single site within sequence i (out of m total

sequences).

The marginal posterior distribution of interest, p(A | S, θ0,B), can be simulated

by drawing iteratively from the distribution of each ai conditional on the site

locations in the other sequences, A∗,

p(ai | A∗, θ0,S,B) ∝

4∏

k=1

(

θN0

0

θN∗

0

0

)

×

w∏

j=1

Γ(m+ |βj|)

Γ(m− 1 + |βj|)

k Γ(Njk + βjk)∏

k Γ(N∗jk + βjk)

w∏

j=1

(

θjrj

θ0rj

)

(2.1)

26

Page 39: Statistical Techniques for Examining Gene Regulation

where the site starting at ai is (r1, . . . , rw). N∗ and N

∗0 are the motif and back-

ground counts from the (m − 1) sequences besides sequence i, and |βj| is the

number of prior pseudo-counts added to position j of the motif matrix. θjk is

the best estimate of the motif frequencies θjk from the (m − 1) sequences besides

sequence i,

θjk =N∗

jk + βjk

m− 1 + |βj|

which is also given in Lawrence et al. (1993). Γ(·) is the Gamma function (Γ(x) =

(x− 1)! for integer x) which results from integrating Θ out of our full conditional

posterior distribution p(A,Θ | S, θ0,B).

Thus, ai can be randomly drawn from all possible starting points in sequence i

with probability proportional to p(ai | A∗, θ0,S,B) given in (2.1), in either exact

or approximate form.

Liu (1994) gives a version of the distribution (2.1) in the case where θ0 is unknown

with prior distribution Dirichlet(β0 = (β01, . . . , β04)),

p(ai | A∗,S,B,β0) ∝

Γ(|N0| + |β0|)

Γ(|N∗0| + |β0|)

k Γ(N0k + β0k)∏

k Γ(N∗0k + β0k)

×

w∏

j=1

Γ(m+ |βj|)

Γ(m− 1 + |βj|)

k Γ(Njk + βjk)∏

k Γ(N∗jk + βjk)

w∏

j=1

(

θjrj

θ0rj

)

(2.2)

where |N0| is the total number of background counts in all m sequences and |N∗0|

is the total number of background counts in the m − 1 sequences excluding se-

quence i. θ0k is the best estimate of the background frequencies θ0k from the m−1

sequences besides sequence i,

θ0k =N∗

0k + β0k

|N∗0k| + |β0|

27

Page 40: Statistical Techniques for Examining Gene Regulation

To avoid being trapped in a phase-shift mode, they also included a Metropolis

step to allow for all the motif sites to move to the left or right by a few positions.

2.4 Unrestricted Model for A

As pointed out in Liu et al. (1995), it is often too restrictive an assumption to hold

the total number of unknown sites as fixed and known.

In this unrestricted model, we consider each Aij as an independent random in-

dicator variable with an a priori probability p0 that Aij = 1 (and hence is a motif

start site). This probability p0 is referred to as the motif abundance parameter.

Since each Aij is independent, we allow for the possibility that some sequences

will have multiple motif sites (ie. several Aij = 1 in sequence i) as well as the pos-

sibility that some sequences may have no motif sites (ie. all Aij = 0 in sequence

i).

This flexibility to allow some sequences to contain no sites is especially important

when analysing sequences within studies where many sequences in a dataset

could be “false-positives”, as described in Section 1.1.

Under this model, the full posterior distribution of our unknown Θ and A is

p(A,Θ | S, θ0, p0,B) ∝ p|A|0 (1 − p0)

L−|A| × θN0

0 ×

w∏

j=1

θNj+βj−1j (2.3)

where |A| is the total number of sites, now assumed to be unknown. The quantity

L = N−(w−1)m, whereN is the total number of nucleotides andm is the number

of sequences. L is the total number of possible site positions, since sites are not

allowed to overlap the ends of a sequence.

28

Page 41: Statistical Techniques for Examining Gene Regulation

Integrating out Θ, we have our marginal posterior distribution of interest

p(A | S, θ0, p0,B) ∝ p|A|0 (1 − p0)

L−|A| × θN0

0 ×

w∏

j=1

k Γ(Njk + βjk)

Γ(|A| + |βj|)(2.4)

Liu et al. (1995) considered θ0 as unknown in which case the marginal posterior

distribution of interest is

p(A | S, p0,B,β0) ∝ p|A|0 (1 − p0)

L−|A| ×

k Γ(N0k + β0k)

Γ(|N0| + |β0|)

×

w∏

j=1

k Γ(Njk + βjk)

Γ(|A| + |βj|)(2.5)

and based on this distribution constructed a predictive updating algorithm based

on the probability equation

p(Aij = 1 | A∗,S, p0,B,β0)

p(Aij = 0 | A∗,S, p0,B,β0)∝

p0

1 − p0×

w∏

j=1

(

θjrj

θ0rj

)

(2.6)

where (r1, . . . , rw) is the site sequence starting at Aij and A∗, θjk, θ0k are the same

as in the previous section.

2.5 Dealing with Multiple Motif Types

Although this situation is not the focus of this paper, it is worth mentioning that

the model (2.3) can be extended to the situation where we suspect that multiple

distinct motif patterns exist in the same set of sequences. The simplest strategy is

to introduce more motif matrices, one for each motif type, and to let the variable

Aij indicate not only the start of a motif site, but also the motif type (Liu et al.,

1995). Another strategy is to mask out the discovered sites of the first motif and

repeat the usual motif-finding procedure (Roth et al., 1998).

As pointed out in Lawrence et al. (1993), searching for several patterns simulta-

neously permits the sharing of information between them to aid in the discovery

29

Page 42: Statistical Techniques for Examining Gene Regulation

of unknown sites of each. They present a multiple-motif version of the multino-

mial sampler, where the multiple motifs are restricted to have the same ordering

(collinearity) between different sequences. Potential modeling of the spacing be-

tween motifs is also mentioned but not implemented.

Liu et al. (1999) mention that this early model for collinearity is computationally

inefficient, and propose that the models for a single motif be combined with a

Hidden Markov Model (HMM) for insertions/deletions between different mo-

tifs. This unified model, called the propagation model, capitalizes on the collinear-

ity properties inherent to hidden Markov models but does not require the large

amount of free parameters that a typical HMM would. There is the additional

model selection issue (Gelman et al. (1995); Kass and Raftery (1995)) for deter-

mining the appropriate total number of different motif patterns.

More recently, Xing et al. (2003) presented LOGOS, a hidden Markov model for

the occurrence of multiple motifs combined with a separate hierarchical Bayesian

Markovian model for each different motif. Frith et al. (2003) introduce software,

Cluster-Buster, which combines the information from known motif patterns to

find dense clusters of motifs in genome-wide searches.

2.6 Extensions of the Bayesian Motif Model

In many situations, very little information is known a priori about either the motif

abundance or the motif width. In these cases, it is preferable to treat both quanti-

ties as random variables instead of fixed and known quantities. As well, we can

consider extending the model beyond the concept of a single block of contingu-

ous nucleotides.

30

Page 43: Statistical Techniques for Examining Gene Regulation

2.6.1 Variable motif abundance p0

The statistical model summarized by (2.4) assumes known motif site abundance

p0. However, in practice one might not have a very good idea how many motif

sites to expect in a given dataset. Current motif-finding algorithms often use ad

hoc estimates of p0, such as assuming a particular number of sites per sequence.

With our continued focus on full Bayesian modeling, we instead consider p0 as a

random variable with a Beta(a, b) prior distribution. Jensen et al. (2004) demon-

strate, via a simulation study, that treating p0 as a random variable leads to better

performance than using a fixed p0.

If we assume that the motif abundance ratio p0 is unknown with a Beta(a, b) prior

distribution, then the full posterior distribution (2.3) becomes

p(A,Θ, p0 | S, θ0,B) ∝ p|A|+a−10 (1 − p0)

L−|A|+b−1 × θN0

0 ×

w∏

j=1

θNj+βj−1j (2.7)

A specific prior distribution for p0 would be a Uniform(0, 1) distribution, which

corresponds to a Beta(1, 1) distribution. This prior distribution is non-informative

in the sense that it will have very little influence on the results compared to the

influence of the observed sequence data.

We can then integrate out the random variable p0 as well as the parameters Θ to

get

p(A | S, θ0,B) ∝ Ba,b(|A|, L− |A|) × θN0

0 ×w∏

j=1

k Γ(Njk + βjk)

Γ(|A| + |βj|)(2.8)

where Ba,b(c, d) is the Beta function∫ 1

0xa+c−1(1 − x)b+d−1dx/

∫ 1

0xa−1(1 − x)b−1dx.

This marginal posterior distribution can be used to construct a predictive updat-

ing algorithm similar to the predictive updating algorithm based on (2.6).

31

Page 44: Statistical Techniques for Examining Gene Regulation

2.6.2 Variable motif width w

Liu et al. (1995) suggest that the assumption of fixed motif width w can be relaxed

somewhat to allow so-called fragmentation of motifs. In a fragmentation model,

only J columns of a motif of width w are selected to form the motif pattern.

This is accomplished by positing additional missing indicator variables for whe-

ther or not each of the w positions of a motif are considered as part of a conserved

motif pattern. This new missing data can be incorporated into a larger model

and a Gibbs sampling strategy can again be used for implementation. This frag-

mentation model is useful for correcting the problem that earlier Gibbs sampling

strategies could get stuck in local modes that were phase-shifted versions of the

true signal.

A slightly different approach to correcting this same phase shift problem is to

insert a Metropolis step within the Gibbs sampler that shifted each motif in one

direction or the other (Liu, 1994).

If we vieww as an unknown variable and treat it directly, then we face a Bayesian

model selection problem (Gelman et al., 1995) since, for different width w, the

dimensionality of the motif parameter Θ is different. Lawrence et al. (1993) use

an ad hoc information per parameter criterion to select the best motif width.

Noting that Θ can be integrated out from the model to avoid the dimensionality

change, Gupta and Liu (2003) place a prior distribution on w, and use a Metropo-

lis step to update w based on the joint distribution.

If we posit w as a random variable with a prior distribution p(w), then our marg-

32

Page 45: Statistical Techniques for Examining Gene Regulation

inal posterior distribution (2.8) becomes

p(A, w | S, θ0,B) ∝ p(w) × Ba,b(|A|, L− |A|) × θN0

0

×

w∏

j=1

k Γ(Njk + βjk)

Γ(|A| + |βj|)

Γ(|βj|)∏

k Γ(βjk)(2.9)

which has both A and w as unknown variables. Possible prior distributions for

w could be Poisson(w0) distribution with w0 representing an a priori expectation

for the motif width. Other possible prior distributions are the Geometric(w0), or

Exponential(w0).

2.6.3 Two-Block Motifs

We consider a final extension for the possibility that a particular regulatory pro-

tein binds to the DNA strand in two places instead of just one. In this case, the

binding motif can be summarized by two conserved blocks that are separated by

a gap of non-conserved nucleotides that can vary slightly in length, as depicted

in Figure 2.2.

Width w 1Width w 2

Block 2Block 1

Gap

Width g

Figure 2.2: Graphical representation of a two-block motif

We let Θ1 and Θ2, with width w1 and w2, be the frequency matrices of the two

motif blocks, respectively. If we assume that the nucleotide composition of both

blocks are independent from each other, it is not difficult to extend our Bayesian

model to accommodate the two-block motifs.

33

Page 46: Statistical Techniques for Examining Gene Regulation

The only complication is that we must account for the gap between the two

blocks, which can be of different length between different sites. If our current

configuration of A has m sites, the gap lengths of these two-block motif sites are

denoted as G = (g1, . . . , gm). We assume a priori that each gi is independent and

that gi ∼ Uniform(G1, G2). In other words, each gap length can be anywhere from

a minimum of G1 to a maximum of G2, with equal probabilities for each gap in

that range.

Due to the rotation of the DNA double-helix, in many studies G1 and G2 are

typically separated by about 3 nucleotides. We now have a marginal posterior

distribution where A, G and the motif widths w1 and w2 are all allowed to vary,

p(A,G, w1, w2|S, θ0,B) ∝ p(w1) × p(w2) × Ba,b(|A|, L− |A|) × θN0

0

×

w1+w2∏

j=1

k Γ(Njk + βjk)

Γ(|A| + |βj|)

Γ(|βj|)∏

k Γ(βjk)(2.10)

with the implicit restriction that each gi lies within the interval [G1, G2].

34

Page 47: Statistical Techniques for Examining Gene Regulation

Chapter 3

Scoring Function Optimization

As described in Chapter 1, there are several existing motif-finding programs that

are more or less related to the models presented in Chapter 2. However, each

algorithm does differ in various parameter settings and model assumptions and

in many cases the user does not have the freedom to alter these settings.

As a consequence, the performance of each program will vary between different

sequence datasets and each program could give different sets of predicted sites.

In addition, most of these programs are stochastic algorithms, so independent

runs within the same program on the same dataset can also lead to different sets

of predicted sites.

Practitioners are disconcerted by these differences, since they lack the means by

which to compare the sets of predicted sites from different programs. Thus, our

initial motivation for this research was to provide a simple but principled rule

for deciding, out of a collection of different configurations of A (different sets of

predicted sites), which configuration of A was the “best”.

In this situation where the single “best” answer to a motif-finding problem is

desired (i.e. the “best” set of site predictions or the “best” consensus matrix),

35

Page 48: Statistical Techniques for Examining Gene Regulation

our goal is to find the optimal value of a particular scoring function. Under our

Bayesian formulation, we focus on scoring functions which are values of an ap-

propriate posterior distribution.

This scoring function formulation enables us to quantify the “goodness” of dif-

ferent configurations of A in terms of their fit to our posterior distribution (and

hence our Bayesian probability model).

Because of the need for a speedy algorithm, it is sensible to seek strategies, such

as optimizing a scoring function, instead of a full posterior analysis. In addition,

due to the intrinsic presence of multiple modes in the marginal distribution of A,

summarizing this distribution with a posterior mean or posterior interval can be

misleading. This is because Gibbs sampling chains started from different initial

values can get stuck in different modes, leading to a posterior mean estimate

which might not be in an area of high posterior mass.

Here we examine several scoring functions that have been used in practice to

evaluate a discovered motif and as well as some novel generalizations.

3.1 Bayesian scoring functions

We begin by assuming for now that the motif width w and the abundance ratio p0

are known, in addition to our running assumption that the background parame-

ters θ0 are fixed and known.

For simplicity, we also assume that the number of prior counts in each column

of the motif matrix is constant, ie. |βj| = |β| for all j. In each scoring function,

we ignore the collection of terms that are constant with respect to the unknown

parameters.

36

Page 49: Statistical Techniques for Examining Gene Regulation

The first scoring function is the exact log-posterior marginal density for A:

ψexact(A) = log p(A | S, θ0, p0, w,B)

= |A|logit(p0) +∑

k

N0k log θ0k +

w∑

j=1

log

[∏

k Γ(Njk + βjk)

Γ(|A| + |β|)

]

(3.1)

Although this exact scoring function may not appear very intuitive to the reader,

it is closely related to the following intuitive scoring function through a series of

approximations including Stirling’s formula (Stirling, 1730),

Γ(x+ 1) = x! ≈ xxe−x(2πx)1/2 (3.2)

Using Stirling’s formula (3.2), we can approximate ψexact as

ψstir(A) = |A|logit(p0) −3

2w log(|A| + |β| − 1)

+

w∑

j=1

k

(Njk + βjk −1

2) log

(

Njk + βjk − 1

|A| + |β| − 1

)

−Njk log θ0k

≈ |A|

[

logit(p0) +w∑

j=1

k

θjk log

(

θjk

θ0k

)]

−3

2w log(|A| + |β| − 1), (3.3)

where θjk =Njk+βjk

|A|+|β|.

Our empirical results show that the Stirling scoring function ψstir tracks ψexact

very well for realistic values of |A| and Njk.

Another scoring function approximation that we can consider is based on the en-

tropy distance between the frequency matrix entries θjk and the fixed background

frequencies θ0k

ψent(A) = |A|

[

logit(p0) +w∑

j=1

k

θjk log

(

θjk

θ0k

)]

(3.4)

Compared with this heuristic-based scoring function, ψStir has an additional term,

which gives an additional penalty to a large number of motif sites.

37

Page 50: Statistical Techniques for Examining Gene Regulation

The entropy distance is also called the Kullback-Leibler information (for discrete

measures) in the statistics literature (Kullback and Leibler, 1951). A form similar

to the entropy scoring function is mentioned in Lawrence et al. (1993).

3.2 Non-Bayesian scoring functions

It is interesting to note that scoring functions related to the entropy approxima-

tion (3.4) have arisen in the motif-finding literature outside of the context of a

Bayesian formulation.

In developing their Consensus algorithm, Stormo and Hartzell (1989) introduced

a scoring function very similar to ψent which they call the information content:

ψinfo(A) =w∑

j=1

k

θjk logθjk

θ0k, where θjk =

Njk

m. (3.5)

where m is the number of sequences. This function is equivalent to all the forego-

ing scoring functions where when the motif width w is assumed known and the

total number of motif sites |A| is assumed to be equal to the number of sequences

m, which was the case in Stormo and Hartzell (1989), Lawrence and Reilly (1990),

and Lawrence et al. (1993), and in the model presented in Section 2.3.

However, when |A| is unknown, function ψinfo cannot be used to find a proper

set of motif sites — it will converge to a set of very few motif sites with high con-

servation and ignore potential sites that are less conserved. A Bayesian remedy

is to give a prior distribution f(A), and then construct

ψ′info(A) = log f(A) + |A|

w∑

j=1

k

θjk logθjk

θ0k.

This scoring function is nearly equivalent to the entropy one we have shown

earlier except that a more flexible prior of A is allowed here. A temptation here

38

Page 51: Statistical Techniques for Examining Gene Regulation

is to use a prior on |A| directly, but this overlooks the “entropy number”, i.e., the

number of different A’s that can give rise to the same value of |A|.

Liu et al. (2002) present an algorithm called MDSCAN for motif-finding based not

only on sequence data but also on gene expression information from microarray

experiments. Since the true p0 is rarely known in practice, they propose to opti-

mize the following scoring function:

ψmd(A) =log(|A|)

w

w∑

j=1

k

θjk logθjk

θ0k. (3.6)

The functional form again shares some similarities to the entropy approximation

given above. Although function ψmd is not intended as an approximation to the

posterior distribution p(A | θ0, p0,S), it can still be used as a scoring function to

evaluate different configurations of A.

3.3 Optimizing a scoring function

Now that our scoring function formulation gives us a means by which to compare

the quality of different configurations of motif start sites A, we will now outline

a simple algorithm for finding the optimal configuration of start sites by finding

the A with the locally best possible score.

We accomplish our optimization of one of the scoring functions described above

by using a Metropolis algorithm-based approach (Metropolis and Ulam (1949);

Metropolis et al. (1953)).

In the Metropolis steps, we systematically scan through every element of the

matrix A, and decide whether the indicator variable at this position should be

“changed” to its opposite value. If Aij is a motif start site (Aij = 1), we remove

39

Page 52: Statistical Techniques for Examining Gene Regulation

that site. If Aij = 0, we add a site starting at that position.

If we denote A′ as A with this change made, then we calculate the following

Metropolis ratio:

r = min{1, exp{ψ(A′) − ψ(A)}/T}

The decision to accept the change or to keep A unchanged is made with proba-

bility r and 1 − r, respectively. The scoring function ψ can be taken as any of the

scores discussed earlier in this chapter. The parameter T is called the temperature

of the algorithm, with low temperatures restricting the algorithm to accept only

small jumps and high temperatures allowing for more freedom to move around

the parameter space.

We focus on the following optimization strategy. The Temperature=0 strategy

forces the algorithm to accept only changes that immediately improve the score,

since forcing T to approach 0 then forces r to equal 0 if ψ(A′) < ψ(A) or r to equal

1 if ψ(A′) ≥ ψ(A). With this type of deterministic strategy, it is important that

we start the algorithm in an area near the mode of the density, or else our simple

hill-climbing algorithm is guaranteed to get stuck in an inferior local mode.

Therefore, one would first want to run the dataset through a ”first-pass” motif-

finding program (e.g. BioProspector, Consensus, AlignAce, or Meme) which

would give a set of predicted sites that are near the area of high posterior density,

and then use these predicted sites as the starting point of a T = 0 optimization

algorithm. In this scenario, our optimization strategy is intended to “clean up”

the output produced by algorithms such as BioProspector, Consensus, AlignAce

or Meme.

Those with experience in the fields of statistical computing or physics must have

40

Page 53: Statistical Techniques for Examining Gene Regulation

recognized that our procedure is just a local hill-climbing method and can be

viewed as a special case of simulated annealing (with immediate freezing).

To optimize the scoring functions outlined in the previous section, we developed

a software package called BioOptimizer which is currently available for Unix plat-

forms. BioOptimizer takes as input both the sequence data as well as a starting

set of motif sites, such as those provided by BioProspector, Consensus, AlignAce

or Meme.

BioOptimizer systematically scans through every element of the matrix A, and

changes the indicator variable at the position Aij to its opposite value only if the

value of the scoring function is improved ie. accepting the change A to A′ only if

ψexact(A′) − ψexact(A) > 0

where ψ is a scoring function from the previous sections.

Thus, BioOptimizer only introduces small changes to A and only accepts changes

that immediately improve the score ψ, so if the algorithm is started near a inferior

local mode then it will converge only to that inferior local mode.

Thus, the output of BioOptimizer is the set of predicted sites with the locally best

possible score. When using the exact scoring function (3.1), these set of predicted

sites are also the locally best possible fit to our model. Thus, when the exact

scoring function is used, BioOptimizer is essentially trying to improve the fit of

the predicted sites to our Bayesian motif discovery model.

If we use the exact scoring function (3.1), then the difference ψexact(A′)−ψexact(A)

that determines whether or not the change is accepted is given by the intuitive

and simple formula (3.7) in the case of adding a motif site, and a corresponding

formula (3.8) in the case of removing a motif site.

41

Page 54: Statistical Techniques for Examining Gene Regulation

If we let A′ be the same as A except for the addition of one site with nucleotides

(r1, r2, . . . , rw), then the difference in exact scores reduces to

ψexact(A′) − ψexact(A) = log

(

p

w∏

j=1

θjrj

θ0rj

)

(3.7)

where p = (|A|+1)/(N−|A|) andN is the total number of potential site locations.

For the site we are potentially adding, we can view the numerator of the product

in (3.7) as the probability of the site being a motif, while the denominator is the

probability of the site being background. Thus, we only accept the addition of the

site if ratio of the probability of motif to background is greater than the estimated

motif abundance p, which has an intuitive appeal.

Removing a motif site involves an analagous formula to (3.7). If we instead let A′

be the same as A except for the removal of one site with nucleotides (r1, r2, . . . , rw),

then the difference in exact scores reduces to

ψexact(A′) − ψexact(A) = log

(

1

p

w∏

j=1

θ0rj

θjrj

)

(3.8)

where now p = |A|/(N − |A| + 1) and N is the total number of potential site

locations. Again, we see that accepting the change involves comparing the ratio

of the motif vs background to the estimated motif abundance.

Although BioOptimizer only does local optimization, it has two basic advan-

tages: (a) it can compare motifs predicted by different motif-finding algorithms

and find the best one among them; and (b) it further improves the motif predic-

tion resulting from any of the current algorithms we have tested, e.g., BioProspec-

tor, Consensus, AlignACE, and Meme. These existing algorithms are proficient

at finding good configurations of A, and each program has its own advantages

in real-data situations. In Chapter 4, we demonstrate that BioOptimizer has im-

proved the motif site predictions in almost all cases.

42

Page 55: Statistical Techniques for Examining Gene Regulation

Briefly considering other Metropolis strategies, the Temperature=1 strategy is equi-

valent to sampling from the posterior distribution, if the score function is the ex-

act log-posterior. But for other types of score functions, this approach imposes a

target density on the parameter space, which may or may not be desirable. One

can run this algorithm over many iterations and analyze the Monte Carlo sam-

ples such obtained. We did not implement this strategy because of an overlap of

the effort with previous approaches such as Gibbs Motif Sampler, AlignAce, and

BioProspector.

A Simulated Annealing (Kirkpatrick et al., 1983) strategy combines deterministic

and stochastic strategies by starting the algorithm at a high temperature such as

T = 4 and then slowly decreasing the temperature down to T = 0 as the al-

gorithm continues through many iterations through all positions of A. For the

current thesis, we restrict ourselves to the goal of the T = 0 strategy, i.e., de-

terministic improvement upon the output from current motif-finding algorithms

such as BioProspector.

3.4 Using Scoring Functions to Extend the Model

In addition to comparing and optimizing sets of predicted motif sites, the scor-

ing function formulation can also be used to extend our model to relax several

assumptions which are necessary for the current motif-finding programs such as

BioProspector, Consensus, AlignAce or Meme to operate. Several of these exten-

sions were discussed in the previous chapter and scoring function formulation

now allows us to implement these more general models.

43

Page 56: Statistical Techniques for Examining Gene Regulation

3.4.1 Overlapping Motif Sites

The posterior distribution of interest (2.4) and the corresponding exact scoring

function (3.1) are based upon the implicit assumption that motifs are not allowed

to overlap each other and so any given nucleotide can contribute at most once to

the motif matrix. A simple modification to the exact scoring function (3.1) can be

made to allow some nucleotides to contribute more than once to the motif matrix

and thus correctly allow for overlapping motifs. The current implementation of

our BioOptimizer software does not have a restriction against overlapping motifs.

3.4.2 Unknown Motif Site Abundance

Most current motif-finding programs have an assumption of known motif abun-

dance, which is either fixed or can be entered by the user. In section (2.6.1) we

described a posterior distribution with a motif abundance parameter p0 that was

not fixed and known, but was allowed to vary with a certain prior distribution

p0 ∼ Uniform(0, 1).

We can then mathematically integrated the random variable p0 out of our model,

which will leave us with a posterior distribution that no longer depends on pre-

specified site abundance, and the corresponding scoring function

ψ′exact(A) = logB1,1(|A|, L− |A|) +

4∑

k=1

N0k log θ0k

+w∑

j=1

log

4∏

k=1

Γ(Njk + βjk)

Γ(|A| + |βj|)

(3.9)

where Ba,b(c, d) is the Beta function as in the previous chapter. Here again L =

N − (w − 1)m, where N is the total number of nucleotides and m is the number

44

Page 57: Statistical Techniques for Examining Gene Regulation

of sequences. L is the total number of possible site positions, since sites are not

allowed to overlap the ends of a sequence.

We can again use the Stirling formula (3.2) to approximate the Γ(·) functions as

well as log[B1,1(|A|, L− |A|)] so that we have

ψ′stir(A) = |A|

[

logit(p0) − 1 +

w∑

j=1

k

θjk log

(

θjk

θ0k

)]

−3

2w log(|A| + |β| − 1) (3.10)

where p0 = |A|/L is the estimated motif abundance ratio and θjk is the same as in

Section 3.1. Removing the same additional penalty term as noted in Section 3.1,

we have our entropy scoring function with a variable motif abundance p0,

ψ′ent(A) = |A|

[

logit(p0) − 1 +w∑

j=1

4∑

k=1

θjk logθjk

θ0k

]

(3.11)

Despite their more complicated mathematical form, these scoring functions are

easy to compute for any A and can be easily implemented in a BioOptimizer

program.

3.4.3 Unknown Motif Width

We can also consider extending our model to allow the width of our unknown

motif to vary. This extension is useful since, in real datasets, there is often very

little known about the motif width a priori. Current motif-finding programs such

as BioProspector, Consensus or AlignAce force the user to input a motif width

that is fixed for the entire run of the program. Meme is the lone exception that

allows the motif width to vary.

We can instead let the motif width w be a random variable that has a prior distri-

bution p(w), which will give us several extra terms in the scoring function (3.9)

45

Page 58: Statistical Techniques for Examining Gene Regulation

for our exact log-posterior density,

ψexact(A, w) = log p(w) + logB1,1(|A|, L− |A|) +N0k log θ0k

−w log

[

Γ(|A| + |β|)

Γ(|β|)

]

+w∑

j=1

k

log

[

Γ(Njk + βjk)

Γ(βjk)

]

(3.12)

This exact scoring function also has a corresponding Stirling approximation,

ψStir(A, w) ≈ log p(w) + |A|

[

logit(p0) − 1 +

w∑

j=1

k

θjk log

(

θjk

θ0k

)]

−w∑

j=1

k

(βjk −1

2) log

[

βjk − 1

|β| − 1

]

−3

2w log

[

|A| + |β| − 1

|β| − 1

]

(3.13)

where again θjk is the same as in Section (3.1).

A natural prior distribution for w would be the Poisson(w0), where w0 represents

our a priori expectation for the motif width. One could also consider other prior

distributions for w, such as Geometric(w0) or Exponential(w0).

Note that these scores are not only a function of the set of predicted sites A but

also the motif width w, so we can now consider an optimization algorithm for

not only the set of predicted sites, but also the motif width.

In addition to the Metropolis algorithm for optimizing A presented in Section

3.3, we can now also propose small changes to the motif width w and accept

these changes if they improve the score. Specifically, we consider to either add

(w′ = w + 1) or delete a position (w′ = w − 1) from the current motif and see if

such a change increases the score ie. accept the change only if

ψ(A, w′) − ψ(A, w) > 0

where again ψ can be any scoring function. Our BioOptimizer software uses

the exact scoring function (3.12). The above procedure and the usual procedure

for optimizing A are iteratively repeated until no further changes to A or w are

accepted, at which point we consider A and w to be “optimized.”

46

Page 59: Statistical Techniques for Examining Gene Regulation

3.4.4 Two-Block Motifs

As mentioned in Section 2.6.3, our model can be extended to motifs that consist of

two contiguous blocks separated by a gap of non-conserved nucleotides that can

vary in length. Of the current motif-finding programs introduced in Chapter 1,

only BioProspector has the flexibility to find two-block motifs with variable gap,

but with the usual restrictions that the width of each block must be fixed and

known.

In Section 2.6.3, we presented a model that not only allows the gaps to vary, but

also the widths of each block as well. The exact scoring function for the posterior

distribution (2.10) is

ψexact(A,G,w) = log p(A,G, w1, w2|S, θ0,B)

= log p(w1) + log p(w2) +

logB1,1(|A|, N − |A|) +4∑

k=1

n0k log θ0k +

w1+w2∑

j=1

log

Γ(|βj|)4∏

k=1

Γ(βjk)

·

4∏

k=1

Γ(Njk + βjk)

Γ(|A| + |βj|)

(3.14)

with the implicit restriction that each gi lies within the interval [G1, G2].

This exact score (3.14) is a function of the predicted sites A, the gaps G between

the two blocks at each site, and the two block widths w1 and w2. We have im-

plemented a two-block version of BioOptimizer that optimizes not only the set of

predicted sites A, but also the gap lengths G (within pre-specified limits [G1, G2])

as well as the two block widths w1 and w2.

47

Page 60: Statistical Techniques for Examining Gene Regulation

3.5 Detecting Poor Motifs with the Null Score

A disadvantage of motif-finding programs such as BioProspector, Consensus,

AlignAce or Meme is that they will always output a predicted motif, even when

a real motif is not present. As mentioned in earlier in the chapter, the scoring

function formulation gives us a principled method by which to compare differ-

ent sets of predicted sites. We can utilize this same benefit to provide a diagnostic

for very poor motif signals found by the usual motif-finding programs.

In addition to comparing different motifs based on their BioOptimizer score, we

can also compare these motifs to the BioOptimizer score generated by a matrix

A containing no predicted sites, ie. Aij = 0 for all i and j. This “null score”

serves as a minimal criterion for any predicted motif. We should not be confident

about any predicted motifs that have a score lower than the “null score”, since

this essentially means the motif signal has less posterior value than no signal at

all.

It is often observed, such as in our simulation study in Section 4.3, that BioOpti-

mizer will often converge to a null motif with no sites when given a poor starting

point, which implies that the starting point definitely had a lower score than the

null score.

In the case where BioOptimizer does not converge to a motif with no sites, the

null score is still calculated for comparison, since the final BioOptimizer score

may still be less than the null score, but can not converge to that null score be-

cause it is stuck in an inferior local mode. Since BioOptimizer also allows the

motif width to vary, the null score is calculated based on a motif of width w0, the

a priori expected motif width.

48

Page 61: Statistical Techniques for Examining Gene Regulation

Chapter 4

Motif Discovery Results

4.1 Simulation Comparison of Scoring Functions

In Sections 3.1 and 3.2, we outlined a few scoring functions that could be used in

a motif-finding algorithm: the exact log-posterior as in (3.1), its Stirling approxi-

mation as in (3.3), its entropy approximation as in (3.4), the scoring function (3.6)

used by the MDscan (Liu et al., 2002), and the information-content function (3.5)

used by Consensus. We designed the following simulation study to investigate

the relative ability of each scoring function to find unknown motif sites under

various sequence conditions.

Since ψinfo is only suitable for the case in which the number of sites is known, we

only compared the effectiveness of the first four scoring functions. We include

the MDscan scoring function here since we are interested in evaluating its perfor-

mance against the other scoring functions, though it is not an approximation to a

posterior distribution.

In order to investigate specifically the ability of each scoring function to improve

motif site prediction, our scoring functions for this simulation study assumed a

49

Page 62: Statistical Techniques for Examining Gene Regulation

known motif width fixed at the true value.

Each simulated dataset consisted of 20 sequences of 200 bps each, with each se-

quence containing exactly one true motif. Datasets were generated multiple (200)

times under each combination of the following conditions:

1. Width of motif: short (8 base pairs) or long (16 base pairs)

2. Degree of motif conservation: high (91%) or low (70%)

High conservation means that each column of the true motif matrix had a dom-

inant nucleotide with 91% probability (all others 3% equally). Low conservation

means that each motif position had a dominant nucleotide with 70% probability

(all others 10% equally). These values of 91% and 70% were chosen somewhat

arbitrarily, but are reasonable when compared to discovered motifs in bacteria

(Sections 4.4-4.5). The number of simulated datasets was limited to be 200 due to

the time required by BioProspector to discover motifs in each dataset.

We also compared the effects of the prior distribution on Θ by using two different

sizes of pseudo-counts, βjk = 2 vs. βjk = 1.1. This comparison will affect the three

scoring functions derived from our complete Bayesian model, but will not affect

ψmd since no prior distribution was involved in its derivation.

We tested the effect of the scoring function optimization strategy for improv-

ing the results from BioProspector. BioProspector was run on each dataset and

the best motif result was retained. We then applied our optimization algorithm,

based upon each of the four scoring functions mentioned above, to this best Bio-

Prospector result. The motif result from each optimization algorithm was also

retained after the optimization algorithm had converged.

50

Page 63: Statistical Techniques for Examining Gene Regulation

Table 4.1 gives the accuracy of the results from algorithms using each of the four

scoring functions. Accuracy is measured by two statistics, the percentage of cor-

rect sites found, and how close the motif consensus found matches the true motif

consensus. Accuracy of Predicted Sites is the percentage of true sites found in

each simulated dataset, averaged over all simulated datasets. Shifting of up to

3 base pairs was allowed. Consensus Match is the proportion of datasets where

the consensus found matches the true consensus (up to 2 mismatched/shifted

letters allowed when w = 8 and 4 allowed when w = 16). The average number of

predicted sites is given in parentheses in the table.

Table 4.1: Simulation comparison of scoring function optimizations

Accuracy of Predicted Sites (Average |A|)Prior Motif Conser BioProspector Scoring Function Optimization

Counts Width vation Results Exact Stirling Entropy MDscan

1.1 8 91 79 (18) 80 (18) 81 (19) 81 (20) 80 (18)2 8 91 79 (18) 80 (18) 80 (18) 67 (15) 80 (18)

1.1 8 70 9 (15) 8 (8) 10 (11) 3 (2) 12 (19)2 8 70 9 (15) 1 (0) 1 (0) 0 (0) 12 (19)

1.1 16 91 85 (17) 91 (19) 91 (20) 91 (23) 80 (16)2 16 91 84 (17) 91 (20) 91 (20) 91 (24) 80 (16)

1.1 16 70 41 (11) 51 (14) 59 (17) 62 (20) 43 (11)2 16 70 41 (11) 51 (13) 54 (14) 41 (10) 43 (11)

Consensus Match (Average |A|)Prior Motif Conser BioProspector Scoring Function Optimization

Width vation Results Exact Stirling Entropy MDscan

1.1 8 91 98 (18) 98 (18) 98 (19) 98 (20) 98 (18)2 8 91 98 (18) 98 (18) 98 (18) 82 (15) 98 (18)

1.1 8 70 22 (15) 18 (8) 22 (11) 10 (2) 26 (19)2 8 70 22 (15) 6 (0) 6 (0) 2 (0) 26 (19)

1.1 16 91 100 (17) 100 (19) 100 (20) 100 (23) 100 (16)2 16 91 100 (17) 100 (20) 100 (20) 100 (24) 100 (16)

1.1 16 70 86 (11) 88 (14) 90 (17) 88 (20) 88 (11)2 16 70 86 (11) 86 (13) 88 (14) 62 (10) 88 (11)

The first conclusion we can reach is that the optimization strategy seems to im-

prove the accuracy of the predicted sites in comparison with the BioProspector re-

51

Page 64: Statistical Techniques for Examining Gene Regulation

sult. Regardless of motif width or conservation, the “accuracy of predicted sites”

is almost always higher for each scoring function compared to the BioProspector

output, except in the case of a short motif/low conservation, where no method

seems to work. Based on our “sample size” of 200, the simulation standard errors

for the percentages in Table 4.1 range from 1.5% (assuming a true percentage of

95%) up to 3.5% (assuming a true percentage of 50%).

The results are not as dramatic for the consensus match, suggesting that the scor-

ing function optimization is primarily refining the signal that has already been

found by the Gibbs sampling-based BioProspector. Thus, it seems that this op-

timization strategy has accomplished its intended goal of “cleaning up” the Bio-

Prospector output.

In general, the algorithms do not do nearly as well for low conservation as high

conservation, especially in the case of the shorter motif. This is partly due to

the fact that the optimization algorithm is deterministically restricted to stay in

the same local mode that the BioProspector output is stuck in, and so these algo-

rithms do not have the freedom to correct a poor starting point.

For the low conservation datasets, performance is much better for a longer mo-

tif than for a shorter motif, suggesting that a certain threshold of information is

needed for the Gibbs sampling algorithm BioProspector, and consequently our

optimization algorithm, to be successful. If conservation is reduced, one needs

a longer motif for the algorithms to do well. In the case of a short motif and

low conservation, extra information (such as a prior information about the motif

locations or Θ) is clearly needed.

The exact, Stirling and entropy scoring functions display similar performance in

most situations, although the entropy scoring function appears to do noticeably

52

Page 65: Statistical Techniques for Examining Gene Regulation

worse in some cases with the larger prior pseudo-counts and is in general most

affected by a change in prior pseudo-counts.

MDscan in general doesn’t perform as well as the three Bayesian scores, except in

the case where the signal is very weak (low conservation and short motif). This

may be because in the case of a really weak signal, the prior distributions used

for the Bayesian scores swamp the weak signal so that it can’t be detected. This is

also shown in by the slightly improved performance in Table 4.1 when the prior

pseudo-counts are smaller. However, in situations where prior information is ac-

tually available, the formal use of a prior distribution will allow us to incorporate

that information properly.

Overall, these simulation results for the predicted sites suggest that there is al-

most always a benefit associated with using a deterministic optimization algo-

rithm to further improve the output from a stochastic algorithm such as Bio-

Prospector, and that this benefit seems generally to be the greatest when using

the exact scoring function or one of its approximations, in terms of a reason-

able number of predicted sites and the accuracy of those sites. The additional

computational cost of the optimization algorithm is small (≈ 2 minutes for each

simulated dataset).

4.2 Real Data Comparison of Scoring Functions

We examine the performance of our different scoring functions on a dataset con-

sisting of 18 E.coli sequences that contain cyclic-AMP receptor protein (CRP)

binding sites. Each sequence is 105 bps long and each contains at least one 22 bps

motif site that has been experimentally determined via the footprinting method

53

Page 66: Statistical Techniques for Examining Gene Regulation

(Lawrence and Reilly, 1990), for a total of 24 known sites in this dataset.

This dataset has been previously analyzed by Lawrence and Reilly (1990) using

an EM algorithm and Liu (1994) using a Gibbs sampler.

Similar to our strategy with the simulated datasets, we first used the program

BioProspector to find a set of initial motif sites, and then use our T = 0 opti-

mization strategy with one of the four scoring functions to further improve the

BioProspector result. For the first three scoring functions, prior pseudo-counts of

βjk = 1.1 were used.

Table 4.2 shows the results from these optimization algorithms, in terms of the

consensus sequence for the motif, the number of sites predicted, and the number

of predicted sites that corresponded to one of the 24 experimentally established

(“correct”) positions of the CRP binding sites. Nucleotides in the consensus se-

quence are capitalized only if they are over 75% conserved in that position.

Table 4.2: Comparison of scoring function optimizations on the CRP dataset

Scoring Function Consensus Sequence Number of Number ofPredicted Sites Correct Sites

BioProspector ttaTtTgAtcgaggTCACActt 9 9 / 24

Exact ttaTgTgAacgagtTCACAttt 15 15 / 24Stirling tttTgTGAtcgagcTCACAttt 18 18 / 24Entropy taaTgTgAtcgaggTCACAttt 20 17 / 24MDscan ttaTgTGAacgaggTCACActt 11 11 / 24

These real data results are similar to the ones from our simulation study. For

each scoring function, the optimization algorithm improved upon the original

BioProspector signal in terms of the number of correct sites predicted.

As shown in Table 4.2, the consensus sequences of the motifs found by using dif-

ferent scoring functions are similar. The three scoring functions (exact, Stirling

54

Page 67: Statistical Techniques for Examining Gene Regulation

and entropy) that are closely related to the complete Bayesian model seem to per-

form noticeably better than the MDscan score, with the Stirling scoring function

performing the best in this example.

As a comparison, the “true” motif based on the alignment of the 24 experimental

sites is displayed in Figure 4.1 in the form of a sequence logo. It is seen that the

differing positions of the 5 consensus sequences in Table 4.2 correspond to the

information-weak or ambiguous positions shown in the sequence logo.

0

1

2

bits

1CGAT

2GAT

3CGTA 4A

CT

5CTG

6GCAT

7CTAG

8CTGA

9 10 11 12CTGA

13

TCAG

14 15

CGT

16

TAC

17

TCGA

18

GTAC

19

TCGA

20 21

GCAT

22

GCAT

Figure 4.1: Sequence logo of known CRP sites

4.3 Simulation Comparison of Motif-Finding Programs

The results of the previous section suggest that our scoring function optimiza-

tion procedure improves prediction accuracy in simulated datasets, and that the

best scoring functions in this regard were the ones derived from our exact log-

posterior distribution (2.4) or a close approximation.

However, the results of the previous section were all based upon comparisons

with a single motif-finding program, BioProspector, and so an additional sim-

ulation study was undertaken to validate the superior performance of our op-

timization program BioOptimizer in motif site prediction over all four current

motif-finding programs: BioProspector, Consensus, AlignAce and Meme.

Again, since this simulation study was designed to specifically examine only

55

Page 68: Statistical Techniques for Examining Gene Regulation

the motif site prediction of BioOptimizer relative to the other motif-finding pro-

grams, we used a version of BioOptimizer based on the exact scoring function

(3.1) for a known motif width, fixed at the true value.

In this simulation study, two hundred sequence datasets were generated under

each combination of several conditions:

1. Number of sequences: small (20 sequences) or large (100 sequences)

2. Width of motif: short (8 base pairs) or long (16 base pairs)

3. Degree of motif conservation: high or low

In each dataset, a true motif site was placed in each sequence. Same as the pre-

vious study, high conservation means that each column of the true motif matrix

had a dominant nucleotide with 91% probability (all others 3% equally), while

low conservation means that each motif position had a dominant nucleotide with

70% probability (all others 10% equally). The number of simulated datasets was

again limited to be 200 due to the time required by each of the motif-finding pro-

grams.

For each simulated dataset, we applied the motif-finding programs BioProspec-

tor, Consensus, AlignAce and Meme, and compared the results in terms of pre-

dicted sites to the true site locations. BioOptimizer was then applied separately

to each set of BioProspector, Consensus, AlignAce, and Meme results, and the

optimized results were also compared to the true site locations.

We compared the performances (Table 4.3) of these algorithms in terms of accu-

racy of predicted sites, which again is the percentage of true sites found (shifting

of up to 3 bps again allowed) in each simulated dataset averaged over all simu-

56

Page 69: Statistical Techniques for Examining Gene Regulation

lated datasets, and total number of predicted sites (|A|).

Table 4.3: Simulation comparison of motif-finding programs

Average % of true sitesMotif Conser True # First-Pass found (Average |A|)Width vation of Sites Program First-Pass BioOptimizer

short high 20 AlignACE 59 (17) 64 (15)short high 20 BioProspector 79 (18) 81 (19)short high 20 Consensus 79 (17) 81 (18)short high 20 MEME 81 (18) 81 (18)

short high 100 AlignACE 50 (55) 70 (76)short high 100 BioProspector 68 (74) 81 (88)short high 100 Consensus 17 (24) 27 (30)short high 100 MEME 49 (50) 80 (87)

long high 20 AlignACE 90 (18) 93 (19)long high 20 BioProspector 85 (17) 92 (19)long high 20 Consensus 90 (18) 92 (19)long high 20 MEME 91 (18) 92 (19)

long high 100 AlignACE 89 (90) 91 (92)long high 100 BioProspector 85 (86) 91 (92)long high 100 Consensus 50 (50) 91 (92)long high 100 MEME 50 (50) 91 (92)

long low 20 AlignACE 27 (14) 30 (8)long low 20 BioProspector 39 (11) 46 (12)long low 20 Consensus 37 (9) 44 (11)long low 20 MEME 45 (11) 48 (12)

long low 100 AlignACE 34 (44) 44 (48)long low 100 BioProspector 38 (41) 54 (59)long low 100 Consensus 45 (48) 54 (59)long low 100 MEME 45 (48) 54 (58)

As shown in Table 4.3, this simulation study demonstrates again that BioOpti-

mizer has improved the accuracy of the motif site prediction over AlignAce, Bio-

Prospector, Consensus, and Meme alone for all combinations of motif length,

conservation level, and number of sequences. Same as in Section 4.1, the simula-

tion standard errors for the percentages in Table 4.3 range from around 1.5% up

to 3.5%, which are generally quite small compared to the differences in percent-

age between BioOptimizer and each first-pass program. The number of predicted

57

Page 70: Statistical Techniques for Examining Gene Regulation

sites is also generally closer to the truth for BioOptimizer over any of the motif-

finding programs alone.

In addition to a clear gain in accuracy from using BioOptimizer, it is also worth

noting that the accuracy seems to be generally best when using BioProspector or

Meme as a starting point compared to Consensus and AlignAce.

For the cases with short motifs and low conservation, the performance of all

motif-finding programs was very poor. In most of these cases, none of the first-

pass algorithms was not able to detect the true motif signal, and BioOptimizer

did not improve upon these results.

In many of these weak signal cases, it was observed that the BioOptimizer algo-

rithm would start from the incorrect signal (based completely on false positive

motif sites) found by a first-pass algorithm and converge to a motif configuration

A with no sites. This may be an added benefit of BioOptimizer over other motif-

finding programs in that BioOptimizer will not tend to give the false impression

of a real motif signal when in fact the correct motif signal has not been found.

4.4 Real Data BioOptimizer Evaluation: One-Block

We examined two sequence datasets, each of which contains a one-block tran-

scription factor binding motif. The first dataset is for the transcription factor

Spo0A in the bacteria B. subtilis. The Spo0A sequence dataset consists of the

200 bp upstream regions of 70 genes that showed preferential hybridization to

the Spo0A protein in Chromatin Immuno-Precipitation experiments (Molle et al.,

2003). We have 20 Spo0A binding sites that have been confirmed experimentally

and can be used to validate our strategy.

58

Page 71: Statistical Techniques for Examining Gene Regulation

There is some prior information about the Spo0A binding motif. The literature

consensus (Strauch et al., 1990) is thought to be a 7-mer, although the true width

of the motif has not been firmly established. Also, it is not known whether or

not the orientation of the bound protein (relative to the gene) is relevant, so we

need to look for sites in both the forward (5′ → 3′) and the reverse complement

strands.

The second dataset is the same CRP dataset used in Section 4.2. This dataset has

been previously analyzed by Stormo and Hartzell (1989), Lawrence and Reilly

(1990) and Liu (1994). Their analyses focused on detecting sites for a motif with a

width of 22 base pairs, which we will use as our prior expectation w0, but we will

let the true width be inferred by the algorithm.

As outlined in the Section 3.3, our basic strategy is to use a current motif-finding

program, such as AlignAce, BioProspector, Consensus, or Meme to find a good

configuration of motif start sites A, and then use our optimization program BioOp-

timizer to improve the score of the motif.

When the exact scoring function is used, BioOptimizer is essentially trying to im-

prove the fit of the predicted sites to our Bayesian motif discovery model. Since

the performances of these motif-finding programs vary with datasets, BioOpti-

mizer has the advantage of being able to build upon motif results from all of

these different first-pass programs.

In most cases, the motif width is not known a priori but must be fixed when using

a first-pass program such as BioProspector, Consensus, or AlignAce. Our strategy

is to collect the motif results from each first-pass program using each of several

different motif widths, and then apply our optimization program BioOptimizer

to each result separately. BioOptimizer will then optimize each motif result with

59

Page 72: Statistical Techniques for Examining Gene Regulation

respect to both the predicted sites and the unknown motif width, as well as pro-

viding an optimal score for each motif result that can be used to compare between

motif results. The “best” motif would then be the motif result with the greatest

BioOptimizer score.

For the spo0A dataset, we ran BioProspector separately for motif widths varying

from 7 to 12 bps, each time collecting the top 5 motif predictions. We also ran

Consensus and AlignAcefor each of these motif widths and collected the top 5

motif results. For the CRP dataset, we collected the top 5 motif results from Bio-

Prospector, Consensus, and AlignAce for each fixed motif width between 20 and

24 bps.

BioOptimizer was then applied to each of these motif results, giving us a total

of 30 optimized spo0A motifs (6 widths × top 5) and 25 optimized CRP motifs

(5 widths × top 5) for each of our first-pass programs BioProspector, Consensus

and AlignAce. Meme has the built-in capability to try different motif widths, so

we collected the top 5 motifs from Meme directly.

Table 4.4 shows the BioOptimizer results with the best score from each first-pass

program for both datasets. Motif predictions from the first-pass programs (used

as BioOptimizer input) are also shown. In addition to the motif width w, con-

sensus sequence and number of predicted sites |A|, we also provide “% True”,

which is the percentage of experimentally-confirmed sites in each dataset that

was found by each algorithm.

We see from the table that the identical optimal CRP motif resulted from three

different starting configurations in terms of both motif width and actual binding

sites. However, as noted in Section 3.3 section, different starting points are not

guaranteed to converge to the exact same optimal configuration, as we see in the

60

Page 73: Statistical Techniques for Examining Gene Regulation

Table 4.4: Comparison of motif predictions for one-block datasets

TF # of Results from First-Pass Program Best BioOptimizer ResultSeqs Program w |A| % True w |A| % True Consensus

spo0A 70 BioProspector 12 40 35 12 50 60 TTTGTCGAAaaaConsensus 11 38 50 11 47 50 TTTGTCgAAaaAlignAce 9 28 30 12 49 60 tTTGTCGAAaaa

Meme 15 38 35 14 50 55 TTTGTCGAAaaatgCRP 18 BioProspector 22 11 43 24 13 57 AtttaTgTGAtcgaggTCACActt

Consensus 24 13 57 24 13 57 AtttaTgTGAtcgaggTCACActtAlignACE 24 10 43 24 13 57 AtttaTgTGAtcgaggTCACActt

Meme 20 18 70 19 18 70 TGTgAacgagttCACAttt

Spo0A results where very different starting configurations led to very similar but

not identical optimal motifs.

In general, BioOptimizer leads to more consistent results even when started from

BioProspector, AlignAce, Meme, or Consensus results that differ in both motif

width and consensus sequence. This is a reassuring result, since there are many

cases in practice where little is known a priori about a binding motif, including its

width.

For both datasets, the optimal motif width seems to be longer than our prior

expectations. It also appears that the binding motif of CRP actually consists of

two highly-conserved blocks with a gap of less-conserved nucleotides, so we will

revisit the CRP dataset in Section 4.5 for our two-block motif results.

Upon examination of the Spo0A results, we see that the number of predicted sites

are generally close to the number of sequences, but there are some sequences in

both of these datasets that do not have site predictions. This is a natural conse-

quence of our model, since we might expect several false-positive sequences in

our dataset, as mentioned in Section 2.4.

For both datasets, the use of BioOptimizer increased the proportion of true sites

found compared to the motif results from one of the first-pass programs alone,

61

Page 74: Statistical Techniques for Examining Gene Regulation

suggesting that BioOptimizer has improved the accuracy of the motif results for

both CRP and Spo0A.

4.5 Real Data BioOptimizer Evaluation: Two-Block

We examined datasets for 4 two-block transcription factors σE, σF, σH and σK

in the bacteria B. subtilis. Given the results in the previous section, we also re-

examined the CRP dataset to see if we could find the CRP binding motif in two

short blocks instead of one long block.

Microarray experiments (Eichenberger et al., 2003) comparing wild-type B.subtilis

cells to cells where the gene for σE had been inactivated and to cells where σE

was overexpressed were used to identify 155 transcriptional units (operons) as

direct targets of the σE binding protein. Our σE dataset consisted of the 200 bp

upstream regions from these 155 transcriptional units. Our σF dataset (S. Wang,

P. Eichenberger and R. Losick, personal communication), σH dataset (Britton et al.,

2002), and σK dataset (P. Eichenberger and R. Losick, personal communication),

consisted of 38, 46 and 76 upstream regions respectively, each found by a similar

set of experiments.

Some prior information is available for each of these two-block binding motifs.

Helmann and Moran Jr. (2002) give the consensus of σE as ATa (block 1) and

cATAcanT (block 2) with a gap of 16-18 bps, the consensus of σF binding mo-

tif as GywTA and GgnrAnAnTw with a gap of 15 bps, the consensus of σH as

RnAGGAawWW and RnnGAAT with a gap of 11-12 bps, and the consensus of σK

as AC and CATAnnnT with a gap of 16-18 bps.

Since neither AlignAce, Consensus, nor Meme can be used to find a two-block

62

Page 75: Statistical Techniques for Examining Gene Regulation

motif, we used only BioProspector as a first-pass program.

For each σ dataset, BioProspector was used to find good starting configurations

under a variety of fixed block widths ranging from 5 to 9 base pairs, and several

different gap ranges (11-13 bps, 12-14 bps, 13-15 bps), resulting in 75 predicted

motifs (the top 5 motifs for each of 5 different widths and 3 different gap ranges).

For the CRP dataset, we specified fixed block widths from 5 to 7 bps with shorter

gap ranges (4-6 bps, 5-7 bps, 6-8 bps), so 45 motifs in total were predicted by

BioProspector.

BioOptimizer was used to optimize and choose the best motif among all the Bio-

Prospector motif predictions for a dataset. For all the five datasets, we used a

prior expected width of 7 bps for both blocks in the BioOptimizer runs.

These best BioOptimizer motifs are shown in Table 4.5, along with the BioProspec-

tor motif result that served as their starting point. The consensus and the number

of predicted sites are given along with the dimension attribute “Dim” of the mo-

tif, defined as w1 − (gap range) − w2.

Just as in our one-block motif results, each of these two-block datasets has a cer-

tain number (“# True Sites”) of experimentally verified sites which were used to

validate our motif prediction accuracy. The Column “% True” indicates the per-

centage of experimentally-confirmed sites found by that motif result.

The BioOptimizer motif results for all four σ datasets resemble their prior con-

sensus sequence. In all five datasets, the use of BioOptimizer increased the pro-

portion of confirmed sites found when compared with the BioProspector motif

result alone. This improvement in accuracy is especially dramatic in the larger

datasets of σE and σK as well as the CRP dataset.

63

Page 76: Statistical Techniques for Examining Gene Regulation

Table 4.5: Comparison of motif predictions for two-block datasets

TF # of # True Results from BioProspector Best BioOptimizer ResultSeqs Sites Dim |A| % True Dim |A| % True Consensus

σE 155 59 8-(11-13)-8 106 46 11-(10-12)-11 145 80 ttgtcaTattt-ttcATAtaatgσF 38 11 9-(11-13)-9 25 64 7-(10-12)-11 38 91 GtaTaaa-tGgcaAtAcTaσH 46 19 7-(13-15)-7 39 68 6-(13-15)-8 80 74 aaAGGa-tagaGAAtσK 76 35 7-(13-15)-7 58 17 5-(14-16)-11 58 57 gcACa-gcATAtgaTaa

CRP 18 23 6-(6-8)-6 17 70 5-(7-9)-7 27 91 tGTcA-CAcattt

It is also worth noting that the accuracy for the CRP dataset was improved in the

two-block analysis relative to our results in Section 4.4.

In addition to this improved accuracy, BioOptimizer also has the important fea-

ture that the motif width is treated as an unknown quantity that can vary. In the

datasets studied here, the optimal motif width found by BioOptimizer was often

substantially different from our a priori expectations.

4.6 Using Different Motif Width Prior Distributions

It is interesting to examine how different prior specifications for motif width w

affect the performance of BioOptimizer. The effect on BioOptimizer of using dif-

ferent prior distributions is a different functional form for the term log p(w) in our

variable-wdith exact scoring function (3.12).

We consider three different prior distributions, each with E(w) = w0:

1. w ∼ Poisson(w0): log p(w) = w log(w0) − w0 − log(Γ(w + 1))

2. w ∼ Exponential(w0): log p(w) = − log(w0) − w/w0

3. w ∼ Geometric(w0): log p(w) = − log(w0) + (w − 1) ∗ log(1 − w−10 )

BioOptimizer locally optimizes the motif width by proposing changes of the form

64

Page 77: Statistical Techniques for Examining Gene Regulation

w′ = w+1 and w′ = w−1, ie. adding or removing a position from one of the ends

of the motif. We can examine how the use of these different prior distributions

penalizes the addition of columns to our motif matrix.

ψPois(w + 1) − ψPois(w) = log(w0/w)

ψExp(w + 1) − ψExp(w) = −w−10

ψGeo(w + 1) − ψGeo(w) = log(1 − w−10 )

There are several things worth noting based upon these functional forms. The

first thing to note is is that as w0 becomes larger, the exponential and geometric

penalty term become very similar and small, since log(1 − x) ≈ −x as x → 0.

Figure 4.2 below shows the behaviour of the exponential and geometric penalties

for smaller w0.

The second thing to note is that only the Poisson penalty term involves both the

expected motif width w0 and the current motif width w. Figure 4.2 also shows

the contour plot for the Poisson penalty term as a function of both w and w0. Not

surprisingly, the penalty for increasing w is largest when w is already much larger

than w0.

We examined the performance of BioOptimizer on two one-block motif datasets,

one with a long true motif (CRP) and one with a short true motif (spo0A), using

the three different prior distributions. For each prior distribution, we also exam-

ined the use of three different values of w0. We used our a priori expected motif

widths (w0 = 22 for CRP, w0 = 7 for spo0A), as well as two small values w0 = 2

and w0 = 1.1 that were intended to provide extra penalty to the addition of more

columns.

We used BioOptimizer under each of these conditions (3 different priors × 3 dif-

65

Page 78: Statistical Techniques for Examining Gene Regulation

2 4 6 8 10

−8−6

−4−2

0

Exp and Geo Penalty

w0

2 4 6 8 10

−8−6

−4−2

0

w0

Poisson Penalty

w0

w

5 10 15 20 25

510

1520

25

Figure 4.2: Comparison of different prior width penalty terms

ferent w0) separately on several BioProspector starting points (25 for CRP, 30 for

spo0A), and the average change in the optimal motif width w made by BioOpti-

mizer for each of these conditions is given in the Table 4.6.

Table 4.6: Performance of different motif width priors

CRP dataset Spo0A datasetPrior w0 Final w - Start w Prior w0 Final w - Start wexp 22 0.00 exp 7 0.20geo 22 0.00 geo 7 0.20

poisson 22 -0.03 poisson 7 0.13exp 2 -0.33 exp 2 0.07geo 2 -0.77 geo 2 0.03

poisson 2 -2.30 poisson 2 -0.23exp 1.1 -0.90 exp 1.1 -0.03geo 1.1 -2.30 geo 1.1 -0.57

poisson 1.1 -2.80 poisson 1.1 -0.43

When using reasonable, literature-based expected motif widths of w0 = 7 for

spo0A and w0 = 22 for CRP, we observe almost no difference between the per-

formance of BioOptimizer when using the three different prior distributions. The

priors show greater differences in performance when using a small w0, which is

66

Page 79: Statistical Techniques for Examining Gene Regulation

not surprising given the results in Figure 4.2.

It is interesting to note that the Poisson prior seems to give final motif widths that

are smaller generally than the other two prior distributions, which implies that

the penalty for adding a column is largest for the Poisson prior when w0 = 2 and

w0 = 1.1.

4.7 Special Restrictions on A in Real Data

For some applications, special restrictions on the unknown matrix of site posi-

tions A can be built into our motif discovery model (and scoring function for-

mulation) to accommodate different levels of uncertainty about different parts of

a sequence dataset. An example is the search for the SpoIIID motif in Bacillus

subtilis. The consensus sequence for the SpoIIID motif was hypothesized by Hal-

berg and Kroos (1994) to aaggACAanc, based on 10 experimentally-confirmed

binding sites.

Two different types of microarray experiments were performed to compare gene

expression between wild type B.subtilis and B.subtilis with the SpoIIID protein re-

moved (Eichenberger et al., 2004). In addition to the usual cDNA microarray ex-

periment which provides a list of genes that are potentially regulated by SpoIIID,

another microarray experiment was performed using Chromatin-Immunopreci-

pitation (ChIP) technology that provides much more certain evidence that SpoI-

IID has a binding site near to particular genes.

Both of these experiments lead to a total list of 89 genes that are potentially reg-

ulated by SpoIIID, but we are more certain about 40 of these genes due to the

additional ChIP experiments. This extra information needs to be incorporated

67

Page 80: Statistical Techniques for Examining Gene Regulation

into our procedure for finding the SpoIIID motif in the upstream sequences of

these genes.

Our solution is a motif discovery model that is a compromise between the re-

stricted model of Section 2.3 and the unrestricted model of Section 2.4 where our

model forces some of the sequences (the ones identified by the ChIP experiment)

to contain at least one site, while the other sequences (from the usual microar-

ray experiment) are unrestricted. This model is implemented using a version of

BioOptimizer that restricts specific rows of the matrix A to always contain at least

one Aij = 1.

However, BioProspector must still be used to find a good starting point for BioOp-

timizer. BioProspector can not be used to fit this compromise model, but can

be used to fit the restricted model where all sequences are forced to contain at

least one site, so BioProspector was used to find an initial motif using only the

ChIP sequences. BioOptimizer was then run on the full dataset using this Bio-

Prospector starting point. This procedure was repeated for a variety of motif

widths (w = 6, 7, . . . , 12), using the top five motifs found by BioProspector for

each width.

Seventeen experimentally-verified sites were available to validate our discovered

motifs, and the BioOptimizer motif with the highest proportion of known sites

predicted (9/17) was a 8 bp motif with consensus sequence gGACAaGc and a to-

tal of 68 predicted sites in our 89 sequences. It should be noted that this motif

was not the motif with the highest BioOptimizer score. The BioOptimizer final

motif result with the highest score was a 12 bp motif with consensus sequence

ataaaAcAaGca, with 100 predicted sites across the 89 total sequences. This

“best” motif correctly predicted less experimentally-verified sites (7/17) and is

68

Page 81: Statistical Techniques for Examining Gene Regulation

not as good of a match to the consensus sequence postulated by Halberg and

Kroos (1994).

69

Page 82: Statistical Techniques for Examining Gene Regulation

Chapter 5

Bayesian Motif Clustering Model

As mentioned in Chapter 1, the procedures for motif discovery described in Chap-

ters 2-4 applied across several sequence datasets results in a collection of discov-

ered motifs with count matrices {N1, . . . ,Nn}. Our focus is now to investigate

this collection for similarity between motifs based on their discovered count ma-

trices.

We use a Bayesian hierarchical model to infer common structure, in the form of

clusters, within our collection of motifs. The data for each discovered motif is

a count matrix Ni which can have different widths and number of counts com-

pared to other TF motifs. Our clustering will be based on a motif matrices with a

fixed width w, so we assume each of these n raw motif matrices should contain a

submatrix Yi, i = 1, . . . , n of dimension w× 4 that will be considered the ”central

motif” upon which the clustering will be based.

70

Page 83: Statistical Techniques for Examining Gene Regulation

5.1 Hierarchical Framework

Hierarchical models are useful in a variety of scientific problems when the struc-

ture of the data suggests multiple levels of uncertainty. We want to include com-

ponents for both within-motif and between-motif variability of the nucleotide

counts Yijk where i indexes the motif, j indexes the w columns within each motif,

and k indexes the four possible nucleotides within each column.

Our model on the within-motif variability between different binding sites for a

count motif Yi is the same product-multinomial model assumed for motif dis-

covery in Chapter 2. We assume that each position (column) of the count matrix

Yi follows an independent multinomial distribution parameterized by the same

column of an unknown frequency matrix Θi ie.

Within-motif level: p(Yi|θi) =w∏

j=1

p(Yij|θij)

Yij = (Yija, . . . , Yijt) ∼ Multinomial(ni, θij = (θija, . . . , θijt))

For our between-motif variability, we simply assume that each motif frequency

matrix Θi in our collection share a common but completely unknown distribu-

tion, denoted F(·), ie.

Between-motif level: Frequency matrices p(Θi)

Θi = (θi1, . . . , θiw) ∼ F(·)

where F(·) is an unknown distribution with w dimensions for the columns × 4

dimensions for the nucleotides (constrained to sum to one).

This unknown distribution F(·) represents the common structure between the

different motifs in the dataset. Estimation of this unknown distribution is com-

71

Page 84: Statistical Techniques for Examining Gene Regulation

plicated by the fact that our frequency matrices Θi are unknown, with only the

count matrices Yi being observed.

A common prior for an unknown distribution F(·) is a Dirichlet process D(γ)

with characteristic smooth measure γ. Here, we have a multidimensional F(·),

so we use a Dirichlet process prior D(γ1 × · · · × γw) where each smooth measure

γj is four dimensional but constrained to sum to one. We take a uniform γj =

Dirichlet(α, . . . , α) for each smooth measure j = 1, . . . , w.

5.2 Clustering of Observations

An important consequence of our model is that it enables similar motifs to be

clustered together into groups with identical frequency matrices. Ferguson (1974)

states that if x1, . . . , xn are n observations, taking on K distinct values ζ1, . . . , ζK

drawn from F(·) with prior D(γ), then

F(·)|ζ1, . . . , ζK ∼ D(γ∗) = D(γ +

K∑

i=1

δζi)

So the distribution of F(·) conditional on K distinct observations is a mixture of

the smooth measure α and K point masses. This point mass component allows

for the clustering of similar observations.

If we were to draw an additional (n + 1)-th observation x from this distribution

D(γ∗), that new observation would either come from the smooth measure γ, or

would take on a value exactly equal to one of the current ζi’s, say ζk, in which

case ζk and x are defined as being in the same cluster.

The conditional distribution p(ζi|ζ−i) of one current observation ζi, given all other

observations ζ−i, is also a mixture between the smooth measure and K point

72

Page 85: Statistical Techniques for Examining Gene Regulation

masses at each of the ζ−i that represent the unique values within ζ−i. Any obser-

vations ζm and ζn in ζ−i that have the same value are defined as being in the same

cluster.

This conditional distribution allows us to implement our model via a Gibbs sam-

pling algorithm (Geman and Geman, 1984), which is a Markov Chain Monte

Carlo strategy for simulating unknown parameters (or sets of parameters) one

at a time by conditioning on the current values of all the other parameters.

Liu (1996) examines the use of Dirichlet processes as a prior in a binomial hi-

erarchical setting. Green and Richardson (2001) discuss the use of the Dirichlet

process as a flexible model for clustering observations, and present an extended

class of Dirichlet-Multinomial allocations for which the Dirichlet process is a lim-

iting case. Medvedovic and Sivaganesan (2002) uses the clustering properties of

the Dirichlet process prior as part of a hierarchical model for gene expression

profiles from microarray data.

5.3 Gibbs Sampling Implementation

For our motif clustering model, the Gibbs sampler could intuitively be based on

p(Θi|Θ−i). However, since our Θi’s are actually unknown, a more efficient clus-

tering procedure involves drawing values of the clustering indicators directly,

without dealing with drawing a frequency matrix Θi for each motif i at each iter-

ation.

We denote our clustering indicators as zi where zi = k if Θi takes on the same

value as Θk (and hence is in the k-th cluster) or zi = 0 if Θi is drawn from the

smooth measure γ (and hence forms a new cluster).

73

Page 86: Statistical Techniques for Examining Gene Regulation

We would like to sample directly from the conditional distribution of these clus-

tering indicators, ie. we want to sample from

p(zi|z−i,Y)

where we again use the notation z−i or Θ−i to mean all the z or Θ parameters

except the i-th one.

p(zi|z−i,Y) =

∫ ∫

p(zi,Θi,Θ−i|z−i,Y) dΘi dΘ−i

∫ ∫

p(Yi|Θi,Θ−i, z,Y−i) × p(Θi|Θ−i, z,Y−i) ×

p(Θ−i|z,Y−i) × p(zi|z−i,Y−i) dΘi dΘ−i

Now, as we mentioned previously, Yi|Θi follows a product multinomial distri-

bution independent of the other count matrices:

p(Yi|Θi, zi,Θ−i, z−i,Y−i) = p(Yi|Θi) = ΘYi

i

where we again use the notation ΘY to mean

∏wj=1

∏4k=1 θ

Yjk

jk .

Also, as we mentioned, the conditional distribution of Θi is a mixture of the

smooth measure γ and K clusters, indexed by zi,

p(Θi|zi = 0) =

[

Γ(4α)

Γ(α)4

]w

Θαi

p(Θi|zi = k,Θ−i, z−i) = δ(Θi = Θk) k = 1, . . . , K

We have the conditional prior distribution of zi, which for the Dirichlet process

prior is

p(zi = 0|z−i) =c

c+ n− 1

p(zi = l|z−i) =nl

c+ n− 1(5.1)

74

Page 87: Statistical Techniques for Examining Gene Regulation

where nl are the number of z’s in z−i which are equal to l. It is evident from

the prior probabilities (5.1) that the probability for joining a particular cluster

increases as the number of observations in that cluster increases, implying that

the Dirichlet process prior favors unequal allocations of observations.

Returning to our posterior probability calculation, if we first consider the case

zi = 0 (ie. forming a new cluster), then we have

p(zi = 0|z−i,Y) ∝

p(Yi|Θi) p(Θi|zi = 0) p(zi = 0|z−i) dΘi

ΘYi

i

[

Γ(4α)

Γ(α)4

]w

Θα−1i

c

c+ n− 1dΘi

∝c

c+ n− 1

w∏

j=1

k Γ(Yijk + α)

Γ(∑

k Yijk + 4α)

Γ(4α)

Γ(α)4(5.2)

Now, for the case where zi = l 6= 0 (ie. joining an existing cluster that already has

a count matrix Yl), then we have

p(zi = l|z−i,Y) ∝

∫ ∫

p(Yi|Θi) δ(Θi=Θl)p(Θ−i|zi = l, z−i,Y−i) ×

p(zi = l|z−i) dΘi dΘ−i

p(Yi|Θl) p(Θl|z,Y−i) p(zi = l|z−i) dΘl

ΘYi

l

p(Yl|Θl, z)p(Θl|z)

p(Yl|z)

nl

c+ n− 1dΘl

∝nl

c + n− 1

ΘYi+Yl+α−1l dΘl

ΘYl+α−1l dΘl

∝nl

c + n− 1

w∏

j=1

k Γ(Yijk + Yljk + α)

Γ(∑

k Yijk + Yljk + 4α)

Γ(∑

k Yljk + 4α)∏

k Γ(Yljk + α)(5.3)

A complete iteration of our Gibbs sampling algorithm results in a complete sam-

ple z of our clustering indicators, which also represents a complete partition of

our motif matrices.

75

Page 88: Statistical Techniques for Examining Gene Regulation

5.4 Motif Alignment

An additional missing component of the analysis is the fact that we do not neces-

sarily know which “central motif” Yi of length w to use within the raw alignment

matrix of length ni > w for motif i. For example, if our clustering algorithm is

based on a fixed width of w = 6 and our i-th raw motif matrix Ni has 8 positions,

than we have three possible choices for our central motif: Yi = columns 1 to 6 of

Ni, Yi = columns 2 to 7 of Ni, or Yi = columns 3 to 8 of Ni.

Our hierarchical clustering model assumes that the Yi for each motif is known,

so we need an additional step where, for each raw data matrix, the best location

of the central motif Yi is drawn conditional the other motifs Y−i and clustering

indicators z−i for the other motifs.

p(Yi|z−i,Y−i) =

p(Yi|Θi) p(Θi|z−i,Y−i) dΘi

=

ΘYi

i

c

c+ n− 1

[

Γ(4α)

Γα4

]w

Θαi dΘi

+K∑

l=1

ΘYi

i

nl

c+ n− 1δ(Θi=Θl)

p(Yl, Θl|z−i)

p(Yl|z−i)dΘi

=c

c+ n− 1

[

Γ(4α)

Γα4

]w ∫

ΘYi+αi dΘi

+K∑

l=1

nl

c+ n− 1

ΘYi+Yl+αi dΘi

ΘYl+αi dΘi

=c

c+ n− 1

[

Γ(4α)

Γα4

]w ∏

k Γ(Yijk + α)

Γ(∑

k Yijk + 4α)

+K∑

l=1

nl

c + n− 1

w∏

j=1

k Γ(Yijk + Yljk)

Γ(∑

k Yijk + Yljk + 4α)

Γ(∑

k Yljk + 4α)∏

k Γ(Yljk + α)(5.4)

This alignment procedure is performed every tenth iteration of the marginal Gibbs

sampler described in the previous section.

76

Page 89: Statistical Techniques for Examining Gene Regulation

5.5 Clustering of Two-Block Motifs

As described in the motif discovery Chapters 2-4, our motif discovery strategy

may focus not only on single-block motifs but also on two-block motifs with a

variable length gap. A subsequent question is how to use the two-block informa-

tion in our clustering model? We propose two different strategies for clustering

two-block motifs.

The first strategy is to separate the two-block motif as two independent single

block motifs, and cluster these new single block motifs together with any original

one-block motifs. The disadvantage of this strategy is that we are ignoring the

linkage between the two blocks, with the advantage being that we are able to

cluster both two-block and one-block motif results together.

The alternative strategy is to cluster the two-block motif as a single entity, but still

allowing separate alignments steps within each block. This strategy acknowl-

edges the inherent link between the two blocks, but does not allow us to cluster

the two-block motifs together with the one-block motif results.

We are interested in examining the similarity and differences in the clustering

results under these two different strategies. Utilizing both strategies gives us

additional power to detect motifs that are similar based on combinations of one

or two blocks.

This type of situation can occur in practice, such as in Table 4.5 where the two-

block motifs for TFs σE and σK seem to share similar second blocks but quite

different first blocks. The combination of these two strategies will play a critical

role when we combine these clustering methods with our motif discovery meth-

ods to predict co-regulated gene clusters in Chapter 7.

77

Page 90: Statistical Techniques for Examining Gene Regulation

5.6 Advantages of our Clustering Model

This model has several advantages over several traditional clustering methods

briefly discussed in Chapter 1. First of all, our hierarchical framework lets us

account for uncertainty in the count matrices that represent each TF motif by

assuming a product multinomial distribution. Most clustering programs, such as

hierarchical tree clustering or K-means clustering would assume the count matrices

are fixed and known without error.

The second advantage is that our clustering strategy does not need to use any ad

hoc distance measures in order to compare motifs. At each iteration of the Gibbs

sampling algorithm, the decision to cluster a particular observation is determined

by the conditional distribution of zi given all other information (z−i,Yi).

Thus, our distance metric is exactly equal to the conditional posterior distribution

under our full Bayesian clustering model, which is analogous to our motif discov-

ery strategy (Chapter 3) of using a scoring function based on the exact posterior

distribution instead of an ad hoc scoring function.

A third advantage of our clustering model allows not only the clusters them-

selves to vary (in terms of which motifs are members of which clusters) but also

the number of clusters is allowed to vary. This is a key improvement over a clus-

tering technique that requires the number of clusters to be fixed (such as K-means

clustering) in this situation, since we have very little idea a priori about how many

motifs we might expect would be similar to each other.

Another advantage of our Bayesian formulation and stochastic implementation is

that it allows us to summarize the model with posterior sampling which gives us

an idea of the variability of our clustering results, whereas traditional clustering

78

Page 91: Statistical Techniques for Examining Gene Regulation

methods typically give only a point estimate.

In Chapter 6 when we discuss strategies for analyzing our clustering results, we

will also focus primarily on point estimates, such as the posterior mode, but we

also will address the variability of our results.

In Chapters 2-4, we extended the usual motif discovery models to the case where

motif width w is allowed to vary. Although our clustering model assumes a

known motif width w, our additional motif alignment steps allow the central

matrix Yi within each raw data matrix Ni to vary, which effectively means that

our clustering results can be based on the “most-conserved” portions of each mo-

tif count matrix, regardless of the differences in width between each discovered

motif.

We also extended motif discovery models to motifs that consist of two conserved

blocks separated by a gap of variable length. Our strategy of combining two dif-

ferent clustering procedures, indepedent-block and joint-block, allows informa-

tion to be shared between one and two-block motifs, while still acknowledging

the natural linkage between each block of a two-block motif.

5.7 Comparison with Other Clustering Priors

As mentioned in Section 5.3, the Dirichlet Process prior favors unequal allocation

of observations, meaning that each new observation has a greater prior probabil-

ity of being placed in a cluster that already has many observations. If we already

have n observations divided into L clusters z with n1, . . . , nL members, then

P (zn+1 = l|z,DP) =nl

c+ nl = 1, . . . , L

P (zn+1 = L+ 1|z,DP) =c

c+ n(5.5)

79

Page 92: Statistical Techniques for Examining Gene Regulation

An alternative is a uniform clustering prior which favors equal allocations of ob-

servations ie. the prior probability that a new observation is placed in any one of

the existing clusters is uniform,

P (zn+1 = l|z,Unif) =1

c+ Ll = 1, . . . , L

P (zn+1 = L+ 1|z,Unif) =c

c+ L(5.6)

In fact, we can consider both the Dirichlet process and uniform clustering specifi-

cations as particular cases of a more general clustering prior distribution, where

P (zn+1 = l|z) ∝ f(nl) l = 1, . . . , L

P (zn+1 = L+ 1|z) ∝ c (5.7)

This general clustering model reduces to the Dirichlet process when f(nl) = nl

and the uniform clustering prior when f(nl) = 1, but more general functions may

be desirable in particular situations.

The prior density of a partition z under either our Dirichlet process or uniform

clustering model can be calculated recursively using either formulas (5.5) or (5.6).

For the Dirichlet process, for the first cluster with members (z1, z2, . . . , zn1), we

have

p(z1, . . . , zn1) = p(z1)p(z2|z1) · · ·p(zn1

|zn1−1)

=1

c+ 1·

2

c+ 2· · ·

n1 − 1

c+ n1 − 1=

c · (n1 − 1)!n1∏

i=1

(c+ i− 1)

Continuing this process through all L clusters in the partition z, we finally have

p(z|DP) =

cL ·L∏

l=1

(nl − 1)!

n∏

i=1

(c+ i− 1)(5.8)

80

Page 93: Statistical Techniques for Examining Gene Regulation

It is worth noting that the prior density (5.8) under the Dirichlet process model

does not depend on the ordering of our recursive calculations ie. the ordering

in which our conditional probabilities were calculated. This means that different

partitions with the same cluster sizes are exchangable under the Dirichlet process

model.

For the uniform clustering model, starting from the first cluster with members

(z1, z2, . . . , zn1), we have

p(z1, . . . , zn1) = p(z1)p(z2|z1) · · · p(zn1

|zn1−1)

=c

1

c+ 1·

1

c+ 1· · ·

1

c+ 1=

c

c(c+ 1)n1−1

Continuing these recursive calculations through all L clusters in the partition z,

we have

p(z|Unif) =cL−1 · (c+ L)

L∏

l=1

(c+ l)nl

(5.9)

Examining the prior density (5.9), we see that the denominator does depend on

the ordering in which our recursive calculations were performed. Thus, we will

get different values of (5.9) for different orderings of unequally-sized clusters,

which should actually be exchangeable.

As suggested by Green and Richardson (2001), to ensure exchangebility of our

uniform clustering model, we make our prior density p(z|Unif) a function of a

“signature” of the partition that is identical for exchangable partitions. For exam-

ple, if we let p(z|Unif) = k · p(z′|Unif) where z′ is z with the zi’s arranged in order

from the largest cluster to the smallest, then the calculation of (5.9) for z′ will be

the same for all exchangable values of z. All of these complications are avoided

in the Dirichlet process model which automatically gives the same prior density

value for exchangable partitions.

81

Page 94: Statistical Techniques for Examining Gene Regulation

We can also compare the behaviour of these two clustering prior specifications

with a simple simulation study, where 1000 complete partitions z = (z1, . . . , zn)

with n = 1000 and c = 1 were generated under both sets of probabilities (5.5)

and (5.6) above. In Figure 5.1, we see the distributions of both the number of

clusters as well as the size of the multiple-member (nl > 1) clusters over all of our

simulated partitions.

Number of Clusters − DP

Fre

qu

en

cy

0 10 20 30 40 50

05

01

00

15

0

Number of Clusters − Uniform

Fre

qu

en

cy

0 10 20 30 40 50

05

01

00

15

02

00

Size of Clusters − DP

Fre

qu

en

cy

0 200 400 600 800 1000

01

00

02

00

03

00

04

00

05

00

0

Size of Clusters − Uniform

Fre

qu

en

cy

0 200 400 600 800 1000

01

00

03

00

05

00

0

Figure 5.1: Comparison of clustering statistics between DP and Uniform priors

As expected, the number of clusters (with multiple members) is much larger un-

der the uniform prior and the size of some clusters from the Dirichlet process are

much larger than any generated from the uniform prior specification.

82

Page 95: Statistical Techniques for Examining Gene Regulation

Chapter 6

Analyzing Motif Clustering Results

In this chapter, we will apply our clustering model to a dataset of 116 different

transcription factors and discuss different strategies for visualizing and analyz-

ing our clustering results. The raw data was provided (C. Lawrence, personal

communication) in the form of 116 nucleotide-count matrices that differed sub-

stantial in appearance, number of counts, and motif width. Between different

motifs, the number of counts varies from less than 10 to 185 counts.

The motifs were generally short, with an average motif width of approximately

11 bps, so only a one-block clustering strategy was used. We focus on a central

motif of width 8, though we will also examine the clustering for a central motif

of width 6 and 10. In Chapter 7, we present an application that combines both

one-block and two-block motif clustering in combination with motif discovery.

For this 116 motif dataset, we also have the extra information that the TF for

each motif has been classified a priori into a particular “protein family” based on

the common physical structure of their DNA-binding domains. For example, one

family of transcription factors is the helix-loop-helix family, which has two DNA-

binding helix domains that bind directly to the DNA strand and are joined by a

83

Page 96: Statistical Techniques for Examining Gene Regulation

loop domain. Table 6.1 shows each protein family in the dataset, along with the

number of motifs, the average number of sites (|A|) and the average width (w).

Table 6.1: Protein Families in Dataset

Family Number Average |A| Average wTEA 1 12 12MADS 5 64 11TATA-BOX 1 54 16RUNT 1 38 9bHLH 6 31 10FORKHEAD 8 25 13NUCLEAR 16 26 13HOMEO-ZIP 1 25 8T-BOX 1 40 11ZN-FINGER 24 28 9PAIRED 3 29 14bZIP 9 26 10REL 6 19 10ETS 7 31 8HOMEO 6 36 9TRP-CLUSTER 5 36 11HMG 6 31 10bHLH-ZIP 4 25 8CAAT-BOX 1 116 16PAIRED-HOMEO 1 21 30IPT/TIG 1 10 16P53 1 17 20AP2 1 185 9UNKNOWN 1 10 8

We will use this extra “protein family” information in order to validate the results

produced by our clustering model. For our Dirichlet process prior described in

Sections 5.1-5.3 above, we chose prior parameters α and prior weight c to both be

equal to 1.

As described in Section 5.3, our Bayesian hierarchical clustering model was im-

plemented using a Gibbs sampling algorithm. Each iteration of the Gibbs sam-

84

Page 97: Statistical Techniques for Examining Gene Regulation

pler produces two vectors, one giving the alignment of each central matrix Yi

within the raw motif matrix Ni for each motif, and the other giving the clustering

indicator zi for each motif.

Since our clustering model is implemented by using a Markov-chain Monte-

Carlo algorithm, it is important to evaluate whether or not our Gibbs sampling

iterations have converged to our desired posterior distribution.

Following the recommendation of Gelman and Rubin (1992), we started separate

chains of our Gibbs sampling algorithm from several different starting points ie.

different initial partitions z0. Examples of our starting partitions are the “each-

in-own” partition: z0i = i for i = 1, . . . , n, the “all-in-one” partition: z0

i = 1 for

i = 1, . . . , n, and a “random” partition, where the clustering indicators were ran-

domly drawn from a discrete uniform distribution.

Each Gibbs sampling chain was run for 500 iterations. The within-chain versus

between-chain variance measure R (Gelman and Rubin, 1992) was calculated for

two functions of the clusters: the average cluster size and the number of clusters.

R was less than 1.1 for both of these quantities after 500 iterations, leading us to

conclude that our MCMC algorithm had converged.

We now discuss several strategies for analyzing the results from our Gibbs sam-

pling implementation.

6.1 Clustering Trees

An intuitive means for examining our overall clustering results is the posterior

probability pij that a particular pair of motifs i and j are in the same cluster. The

value of pij for any two motifs i and j can be estimated by the proportion of

85

Page 98: Statistical Techniques for Examining Gene Regulation

iterations that have motif i and j in the same cluster. This quantity is a Monte

Carlo estimate of the posterior mean of the indicator variable for motif i and j

being in the same cluster.

Based on these pairwise clustering probabilities pij, a pairwise distance measure

can be calculated between each pair of motifs in the dataset, dij = 1 − pij. The

distance matrix for an entire dataset can then be analyzed by a single-linkage,

average-linkage or complete-linkage hierarchical tree algorithm. The result of

this procedure is a tree structure, which visualizes the clustering pattern for the

entire dataset.

The clustering tree, based on the average-linkage hierarchical tree algorithm for

our dataset with central motif width of 8 bps is given in Figure 6.1. With the

restriction of an 8 bp central motif width, our dataset was reduced to 90 valid

motifs (ie. motifs with width greater than or equal to 8 bps).

The motifs are labeled by both the motif “name”, which is of the form MAxxxx

where xxxx is a number, as well as the protein family to which that motif belongs.

Any motifs that did not cluster with any other motifs are not shown. The length of

each tree “branch” shared by a group of motifs is proportional to the probability

that the group of motifs are in the same cluster.

We can see several interesting relationships from this clustering tree. There are

several groups of motifs that always cluster together but do not cluster with any

other motifs, such as the (ETS-MA0062,ETS-MA0028,ETS-MA0078) group in the

middle of the tree. The clustering tree also allows us to weaker, more variable

clustering relationships between motifs, such as bZIP-MA0102 in the middle of

the tree, which has a low but non-zero probability of being grouped with the

much tighter pair of motifs bZIP-MA0025 and bZIP-MA0043.

86

Page 99: Statistical Techniques for Examining Gene Regulation

NU

CL

EA

R−

MA

00

74

NU

CL

EA

R−

MA

01

16

NU

CL

EA

R−

MA

00

71

NU

CL

EA

R−

MA

00

72

NU

CL

EA

R−

MA

01

17

NU

CL

EA

R−

MA

00

66

NU

CL

EA

R−

MA

01

11

RE

L−

MA

00

61

RE

L−

MA

01

05

bH

LH

−Z

IP−

MA

00

58

bH

LH

−Z

IP−

MA

00

59

TR

P−

CL

US

TE

R−

MA

00

50

TR

P−

CL

US

TE

R−

MA

00

51

FO

RK

HE

AD

−M

A0

04

1

FO

RK

HE

AD

−M

A0

04

7

bH

LH

−M

A0

05

5

bH

LH

−M

A0

04

8

ET

S−

MA

00

62

ET

S−

MA

00

28

ET

S−

MA

00

76

HO

ME

O−

MA

00

27

HM

G−

MA

00

44

bZ

IP−

MA

01

02

bZ

IP−

MA

00

25

bZ

IP−

MA

00

43

RE

L−

MA

01

07

RE

L−

MA

00

23

RE

L−

MA

01

01

bZ

IP−

MA

00

18

bZ

IP−

MA

00

97

FO

RK

HE

AD

−M

A0

04

2

FO

RK

HE

AD

−M

A0

04

0

NU

CL

EA

R−

MA

00

07

NU

CL

EA

R−

MA

01

09

ZN

−F

ING

ER

−M

A0

01

2

ZN

−F

ING

ER

−M

A0

01

0

HM

G−

MA

00

84

ZN

−F

ING

ER

−M

A0

01

3

FO

RK

HE

AD

−M

A0

03

0

FO

RK

HE

AD

−M

A0

03

3

FO

RK

HE

AD

−M

A0

03

2

FO

RK

HE

AD

−M

A0

03

1

Figure 6.1: Clustering tree for dataset based on a motif width of 8 bps

Looking at the protein family information, it is clear that most of the high-probab-

ility clusters of motifs all belong to the same family, providing a strong indica-

tion that TFs in the same protein family can have very similar motifs. There

are interesting exceptions, such as HMG-MA0084, which has a high probability of

clustering with ZN-FINGER-MA0010 and ZN-FINGER-MA0012. Also, it seems

that ZN-FINGER-MA0013 has a moderately high probability of clustering with

the FORKHEAD cluster consisting of motifs MA0030-MA0033. Finally, these two

larger groupings, both shown on the right in Figure 6.1 have a low probability of

clustering with each other, as indicated by the short common branch at the top of

87

Page 100: Statistical Techniques for Examining Gene Regulation

the figure. These relationships may merit further examination to see if there is a

biologically significant reason behind the similarity of these groups of motifs.

6.2 Best Clustering Partition

Although they allow us to examine the clustering structure of the entire dataset,

these clustering tree is not ideal for deducing the “best partition” or best set of

clusters in the dataset, since the clustering tree represents a posterior mean across

many different partitions. This is the same problem that was mentioned for the

technique of hierarchical tree clustering in Chapter 1. One could “cut the tree” at

any number of different threshold distances and thereby produce any number of

possible partitions, but a less arbitrary alternative is to take our best estimate of

the posterior mode of our clusters.

We estimate this posterior mode by calculating the posterior value of the partition

z at the end of each iteration of our sampler, and retaining the partition z with the

highest posterior value as our best estimate of the mode.

The posterior value p(z|Y) of z is calculated as the product of the likelihood value

p(Y|z) and the prior value p(z|α). If our partition z has L clusters, each with nl

members and count matrix Yl (the sum of all w × 4 count matrices in cluster l),

then the likelihood value is

p(Y|z) ∝

L∏

l=1

p(Yl|Θl)p(Θl|z)dΘl ∝

L∏

l=1

ΘYl

l Θα−1l ∝

L∏

l=1

w∏

j=1

k Γ(Yljk + α)

Γ(∑

k Yljk + 4α)

The prior value of a partition z (conditional on our Dirichlet process prior with

88

Page 101: Statistical Techniques for Examining Gene Regulation

measure α and prior weight c) was calculated in Section 5.7 to be

p(z|α) =

cL ·L∏

l=1

(nl − 1)!

n∏

i=1

(c+ i− 1)

So, our posterior value for a particular partition z with L clusters, each with nl

members and count matrices Yl, is

p(z|Y) ∝L∏

l=1

w∏

j=1

k Γ(Yljk + α)

Γ(∑

k Yljk + 4α)×

cL ·L∏

l=1

(nl − 1)!

n∏

i=1

(c+ i− 1)(6.1)

For our dataset, the partition z with the highest posterior value consisted of 16

multiple-member clusters containing 42 out of 90 total motifs. These 16 clusters

are listed in Table 6.2, along with the cluster size, cluster strength, total number

of sites in the cluster (|A|) and the consensus sequence for the cluster. The clus-

ter strength statistic will be explained in Section 6.3. The consensus sequence is

a representation of the total count matrix for the cluster, giving the nucleotide

with the highest count in each position. A nucleotide is only capitalized if its

nucleotide frequency is greater than 0.75 in that position.

Also given in Table 6.2 are motifs contained in each cluster, and the proportion

of each protein family present in that cluster. As suggested by the clustering tree

in Section 6.1, most of our “best” clusters contain motifs from within a single TF

protein family. Three exceptions are: cluster 2 which is mostly FORKHEAD mo-

tifs but also contains a ZN-FINGER motif MA0013, cluster 7 which contains two

ZN-FINGER motifs and one HMG motif MA0084, and cluster 16, which contains

a HOMEO motif and a HMG motif.

Although this best partition has reduced our dataset to a list of interesting clus-

ters, we have lost information about the variability of these clusters by focusing

89

Page 102: Statistical Techniques for Examining Gene Regulation

Table 6.2: Best partition of clusters for dataset

Clus Size Strength |A| Consensus Families Motifs1 5 187.7 145 gTAGGTCA NUCLEAR (5/5) MA0066 MA0071 MA0072

MA0117 MA01112 5 140.9 93 gTAAACAa FORKHEAD (4/5) MA0030 MA0033 MA0032

MA0031ZN-FINGER (1/5) MA0013

3 3 72.0 44 GgaTTTCC REL (3/3) MA0023 MA0101 MA01074 3 70.9 55 aCCGGAAg ETS (3/3) MA0028 MA0062 MA00765 2 46.0 32 AAgcGAAA TRP-CLUSTER (2/2) MA0050 MA00516 2 45.6 48 taaGaACa NUCLEAR (2/2) MA0007 MA01097 3 45.6 49 taaACAAt ZN-FINGER (2/3) MA0010 MA0012

HMG (1/3) MA00848 2 43.6 38 acCACGTG bHLH-ZIP (2/2) MA0058 MA00599 2 38.6 49 TGTTTaTt FORKHEAD (2/2) MA0042 MA0040

10 3 38.1 59 TTacGtAA bZIP (3/3) MA0025 MA0043 MA010211 2 37.8 20 cGaGTTCA NUCLEAR (2/2) MA0074 MA011612 2 34.8 64 TgTTtgtT FORKHEAD (2/2) MA0041 MA004713 2 30.7 56 GGGgatTc REL (2/2) MA0061 MA010514 2 27.4 49 gTGACGTG bZIP (2/2) MA0018 MA009715 2 20.2 70 CAGCTGcg bHLH (2/2) MA0055 MA004816 2 13.2 23 gTtGTact HOMEO (1/2) MA0027

HMG (1/2) MA0044

on a point estimate. The first two clusters mentioned as exceptions in the previ-

ous paragraph are also discussed in Section 6.1, but with the additional informa-

tion that MA0084 is very strongly linked to the other members of its cluster while

MA0013 seems to have a somewhat lower probability of being included in its

cluster. In the next section, we discuss characteristics that allow us to summarize

some of the variability present within our best partition.

6.3 Strength of Clusters

We can also examine cluster-level and observation-level clustering characteristics

within this best partition of our motif matrices. We can measure the strength of

each cluster by calculating the Bayes factor (Kass and Raftery, 1995) for the current

cluster l, with members z = (z1, z2, . . . , znl), versus each member of the cluster

90

Page 103: Statistical Techniques for Examining Gene Regulation

forming its own cluster,

Strength(Cluster l) = log

[

P (z all same |Y)

P (z all different |Y)

]

= log

[

P (Y|z all same)

P (Y|z all different)×

P (z all same)

P (z all different)

]

For a cluster of motifs (Y1, . . . ,Ym) and clustering indicators z = (z1, . . . , zm),

Strength = log

ΘY+α−1dΘ

m∏

i=1

ΘYi+α−1i dΘi

×(m− 1)!

cm−1

= log

w∏

j=1

Q

k Γ(Yjk+α)P

k Γ(Yjk+4α)

m∏

i=1

w∏

j=1

Q

k Γ(Yijk+α)P

k Γ(Yijk+4α)

×(m− 1)!

cm−1

where Y and Θ again denote the count and frequency matrices for the entire

cluster together.

The clusters within our best partition can then be ranked by this measure of clus-

ter strength, giving us an extra measure of confidence/uncertainty about infer-

ence based upon a specific cluster. In Table 6.2 above, the 16 clusters from our

best partition are ranked from strongest to weakest. It is clear from the table that

this measure of cluster strength is quite dependent upon the size of the cluster:

larger clusters tend to have a higher value of cluster strength.

We can also measure clustering strength at the level of individual motifs within

our best partition by calculating, for each motif, the posterior probability that it

should belong to that cluster, as opposed to any of the other existing clusters or

being its own cluster. For each motif i, this posterior probability p(zi|z−i,Y) is

the same calculation that is performed during each iteration of our Gibbs sam-

pling algorithm, but in this case we are conditioning on the best partition i.e.,

p(zi|z−i,Y).

91

Page 104: Statistical Techniques for Examining Gene Regulation

For most of the motifs in Table 6.2, the individual clustering probabilities are very

close to 1. The two motifs MA0084 and MA0013mentioned in the previous section

have individual clustering probabilities of 1.000 and 0.983 respectively, indicating

some variability but otherwise an overwhelming tendency to be in their assigned

cluster.

There are also a few motifs that show a much higher variability for being in their

particular cluster. MA0025 and MA0102 have probabilities of 0.575 and 0.071 of

being in Cluster 10, while MA0027 and MA0044 both have probabilities of 0.240

of being in Cluster 16.

Given that both motifs MA0027 and MA0044 have low individual clustering prob-

abilities, it is not surprising that Cluster 16 is also the weakest cluster on our

Cluster Strength measure. In many large clustering datasets, such as the ones we

will encounter in Chapter 7, it may be advisable to eliminate these weaker motifs

from the best partition.

6.4 Examining Particular Clusters in Detail

We can examine our best partition in detail by looking at the sequence logos for

individual clusters. Figure 6.2 shows the sequences logos for cluster 1 (containing

only NUCLEAR motifs) and cluster 2 (containing mostly FORKHEAD motifs)

from Table 6.2 along with the sequence logos across the entire NUCLEAR and

FORKHEAD families.

Not surprisingly, clusters 1 and 2 show much higher motif conservation than

the motifs that represent the entire NUCLEAR and FORKHEAD families, re-

spectively. The application of our clustering model has allowed us to identify

92

Page 105: Statistical Techniques for Examining Gene Regulation

Cluster 1 Entire NUCLEAR family

0

1

2

1

T

C

G

2

G

AT

3

GA

4

C

G5

T

A

G6

C

GT

7

A

G

C

8

T

C

A0

1

2

1

C

A

G

2

A

G

T

3

C

T

AG

4

A

C

T

G

5

C

T

G

6

A

C

T

7

T

A

C

8

T

C

G

A

Cluster 2 Entire FORKHEAD family

0

1

2

1

AG

2

C

T

3

C

A

4

A5

G

A6

A

TC

7

T

G

A8

G

C

T

A

0

1

2

1

A

C

G

T

2

TAG

3

G

AT

4

AT

5

G

AT

6

T

G

A

7

C

A

T

8

C

G

AT

Figure 6.2: Sequence logos for clusters 1 and 2, with families

highly-conserved subgroups within several of the protein families present in our

dataset. This subgroup information can be used to further improve motif discov-

ery for additional motifs belonging to this protein family, since motifs based on

these clustered subgroups should be easier to detect in large sequence databases

than the weaker motif based on the entire protein family.

6.5 Effect of Prior Specification on Clustering Results

In Section 5.7, we observed dramatic differences in terms of number of clusters

and average cluster size between partitions z generated directly from the Dirich-

let process prior compared to the Uniform clustering prior. Even though the two

priors seem quite different, we should also examine the posterior clustering re-

sults for the TF motif data between models with the Dirichlet Process prior and

93

Page 106: Statistical Techniques for Examining Gene Regulation

the Uniform prior.

The distribution (over all partitions produced by the Gibbs sampler) of the num-

ber of multiple-member clusters and the average size of these clusters is given in

Figure 6.3.

Number of Clusters − DP

Fre

quen

cy

14 15 16 17 18

050

100

150

Number of Clusters − Unif

Fre

quen

cy

14 15 16 17 18

050

100

150

200

Average Cluster Size − DP

Fre

quen

cy

2.40 2.45 2.50 2.55 2.60 2.65 2.70

050

100

150

Average Cluster Size − Unif

Fre

quen

cy

2.40 2.45 2.50 2.55 2.60 2.65 2.70

050

100

150

200

Figure 6.3: Clustering statistics between Uniform and DP models

The uniform clustering model tends to produce somewhat larger numbers of

clusters with a somewhat smaller average cluster size. However, although some

difference is evident between the two models in terms of these cluster character-

istics, the results are not nearly as dramatic when compared to the prior simula-

tions in Section 5.7.

We also examined the differences between our Dirichlet process and uniform

clustering models in terms of the clustering trees (Section 6.1) and best parti-

tions (Section 6.2). Figure 6.4 gives the clustering trees produced under both the

94

Page 107: Statistical Techniques for Examining Gene Regulation

Dirichlet process and uniform clustering models.

The clustering trees in Figure 6.4 are nearly identical except for a couple arbitrary

differences in the ordering of the branches. The best partition found with both the

Uniform and Dirichlet process models are identical both in terms of the clusters

themselves as well as their ranking by strength.

NU

CLE

AR

−MA

0074

NU

CLE

AR

−MA

0116

NU

CLE

AR

−MA

0071

NU

CLE

AR

−MA

0072

NU

CLE

AR

−MA

0117

NU

CLE

AR

−MA

0066

NU

CLE

AR

−MA

0111

RE

L−M

A00

61R

EL−

MA

0105

bHLH

−ZIP

−MA

0058

bHLH

−ZIP

−MA

0059

TRP

−CLU

STE

R−M

A00

50TR

P−C

LUS

TER

−MA

0051

FOR

KH

EA

D−M

A00

41FO

RK

HE

AD

−MA

0047

bHLH

−MA

0055

bHLH

−MA

0048

ETS

−MA

0062

ETS

−MA

0028

ETS

−MA

0076

HO

ME

O−M

A00

27H

MG

−MA

0044

bZIP

−MA

0102

bZIP

−MA

0025

bZIP

−MA

0043

RE

L−M

A01

07R

EL−

MA

0023

RE

L−M

A01

01bZ

IP−M

A00

18bZ

IP−M

A00

97FO

RK

HE

AD

−MA

0042

FOR

KH

EA

D−M

A00

40N

UC

LEA

R−M

A00

07N

UC

LEA

R−M

A01

09ZN

−FIN

GE

R−M

A00

12ZN

−FIN

GE

R−M

A00

10H

MG

−MA

0084

ZN−F

ING

ER

−MA

0013

FOR

KH

EA

D−M

A00

30FO

RK

HE

AD

−MA

0033

FOR

KH

EA

D−M

A00

32FO

RK

HE

AD

−MA

0031

Clustering Tree − DP

NU

CLE

AR

−MA

0074

NU

CLE

AR

−MA

0116

NU

CLE

AR

−MA

0111

NU

CLE

AR

−MA

0066

NU

CLE

AR

−MA

0117

NU

CLE

AR

−MA

0071

NU

CLE

AR

−MA

0072

RE

L−M

A00

61R

EL−

MA

0105

bHLH

−ZIP

−MA

0058

bHLH

−ZIP

−MA

0059

TRP

−CLU

STE

R−M

A00

50TR

P−C

LUS

TER

−MA

0051

FOR

KH

EA

D−M

A00

41FO

RK

HE

AD

−MA

0047

bHLH

−MA

0055

bHLH

−MA

0048

ETS

−MA

0062

ETS

−MA

0028

ETS

−MA

0076

HO

ME

O−M

A00

27H

MG

−MA

0044

bZIP

−MA

0102

bZIP

−MA

0025

bZIP

−MA

0043

RE

L−M

A01

07R

EL−

MA

0023

RE

L−M

A01

01bZ

IP−M

A00

18bZ

IP−M

A00

97FO

RK

HE

AD

−MA

0042

FOR

KH

EA

D−M

A00

40N

UC

LEA

R−M

A00

07N

UC

LEA

R−M

A01

09FO

RK

HE

AD

−MA

0032

FOR

KH

EA

D−M

A00

31FO

RK

HE

AD

−MA

0030

FOR

KH

EA

D−M

A00

33ZN

−FIN

GE

R−M

A00

10H

MG

−MA

0084

ZN−F

ING

ER

−MA

0012

ZN−F

ING

ER

−MA

0013

Clustering Tree − Unif

Figure 6.4: Comparison of clustering trees between Uniform and DP models

95

Page 108: Statistical Techniques for Examining Gene Regulation

The Dirichlet process prior and Uniform prior give dramatically different clus-

tering results based upon prior simulation alone, but show very slight differ-

ences in the posterior clustering results of our TF motif application. However,

other datasets may show a larger influence of the prior specification on the pos-

terior clustering results. Green and Richardson (2001) demonstrate with several

datasets that the unequal allocations favored by the Dirichlet process priors can

persist in the posterior distribution.

6.6 Effect of w on Clustering Results

A natural question that arises with our clustering model is whether the clustering

results would be dramatically different if the width of the central motif w was

different. Some of the differences resulting from using different motif widths

should be negated by the motif alignment steps (Section 5.4) which can vary the

central motif within each raw motif matrix, but some effects of using different

motif widths will still persist.

In order to examine the effect of motif width, our model was used to cluster

our 116 motif dataset using a width of 6 bps and 10 bps, in addition to the 8

bps model studied thus far. Any motifs that were shorter than the central motif

width were excluded from the clustering procedure. Using a motif width of more

than 10 bps would exclude a substantial portion of our 116 motifs, as can be

seen from Figure 6.5 which gives motif width distribution in the dataset. The

obvious trend in Figure 6.6 is that lower motif width leads to a higher number

of clusters and more motifs included in clusters. This same trend can be seen in

the best partitions for each width. The best partition for the w = 6 model has

29 clusters containing 80 motifs, the best partition for the w = 8 model has 16

96

Page 109: Statistical Techniques for Examining Gene Regulation

clusters containing 42 motifs, and the best partition for the w = 10 model has 11

clusters containing 30 motifs.

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Motif Width

05

1015

Figure 6.5: Distribution of motif widths in dataset

As mentioned in Section 6.1, using a 8-bp central motif includes 90 out of 116

motifs in our dataset, compared with 111 out of 116 motifs when using a width

of 6 bps and 72 out of 116 motifs when using a width of 10 bps. Figure 6.6 shows

the clustering trees for our motif dataset with central motif widths of 6, 8 and 10

bps.

This trend is a result of two separate factors. The first is that a smaller motif

width allows more motifs to be included in the clustering procedure since we

only exclude motifs that have width less than our central motif width. The second

factor is that a smaller motif width essentially relaxes the criteria for two motifs

to cluster together, since the number of motif positions that need to be similar

between the two motifs is reduced.

97

Page 110: Statistical Techniques for Examining Gene Regulation

Clustering Tree with w = 6

ZN−F

INGER

−MA0

088

ZN−F

INGER

−MA0

095

HMG−

MA00

84HM

G−MA

0077

HMG−

MA00

87ZN

−FING

ER−M

A003

5ZN

−FING

ER−M

A003

7RE

L−MA

0061

REL−

MA01

05ET

S−MA

0080

ZN−F

INGER

−MA0

056

IPT/TI

G−MA

0085

TRP−

CLUS

TER−

MA00

50TR

P−CL

USTE

R−MA

0051

bHLH

−MA0

055

bHLH

−MA0

048

ZN−F

INGER

−MA0

038

HOME

O−MA

0070

TRP−

CLUS

TER−

MA00

34TR

P−CL

USTE

R−MA

0054

HOME

O−MA

0027

HMG−

MA00

44bZ

IP−MA

0025

bZIP−

MA00

43ET

S−MA

0062

ETS−

MA00

28ET

S−MA

0026

ETS−

MA00

76RE

L−MA

0022

REL−

MA00

23RE

L−MA

0101

REL−

MA01

07ZN

−FING

ER−M

A002

0ZN

−FING

ER−M

A002

1bZ

IP−MA

0019

bZIP−

MA01

02NU

CLEA

R−MA

0017

NUCL

EAR−

MA01

12NU

CLEA

R−MA

0113

NUCL

EAR−

MA00

16NU

CLEA

R−MA

0074

NUCL

EAR−

MA01

15NU

CLEA

R−MA

0117

NUCL

EAR−

MA00

66NU

CLEA

R−MA

0111

NUCL

EAR−

MA00

72NU

CLEA

R−MA

0071

NUCL

EAR−

MA01

14bZ

IP−MA

0096

bZIP−

MA00

18bZ

IP−MA

0097

ZN−F

INGER

−MA0

012

ZN−F

INGER

−MA0

013

ZN−F

INGER

−MA0

011

MADS

−MA0

052

ZN−F

INGER

−MA0

010

HMG−

MA00

45 ZN−F

INGER

−MA0

049

PAIRE

D−HO

MEO−

MA00

68FO

RKHE

AD−M

A004

1FO

RKHE

AD−M

A004

0FO

RKHE

AD−M

A004

2FO

RKHE

AD−M

A004

7NU

CLEA

R−MA

0007

NUCL

EAR−

MA01

09FO

RKHE

AD−M

A003

0FO

RKHE

AD−M

A003

1FO

RKHE

AD−M

A003

3FO

RKHE

AD−M

A003

2bH

LH−Z

IP−MA

0104

bHLH

−ZIP−

MA00

93bH

LH−Z

IP−MA

0058

bHLH

−MA0

004

bHLH

−ZIP−

MA00

59bH

LH−M

A000

6PA

IRED−

MA00

14ZN

−FING

ER−M

A007

9TA

TA−B

OX−M

A010

8ZN

−FING

ER−M

A001

5TE

A−MA

0090

P53−

MA01

06MA

DS−M

A000

5MA

DS−M

A000

1MA

DS−M

A008

2

Clustering Tree with w = 8

NUCL

EAR−

MA00

74

NUCL

EAR−

MA01

16

NUCL

EAR−

MA00

71

NUCL

EAR−

MA00

72

NUCL

EAR−

MA01

17

NUCL

EAR−

MA00

66

NUCL

EAR−

MA01

11

REL−

MA00

61

REL−

MA01

05

bHLH

−ZIP−

MA00

58

bHLH

−ZIP−

MA00

59

TRP−

CLUS

TER−

MA00

50

TRP−

CLUS

TER−

MA00

51

FORK

HEAD

−MA0

041

FORK

HEAD

−MA0

047

bHLH

−MA0

055

bHLH

−MA0

048

ETS−

MA00

62

ETS−

MA00

28

ETS−

MA00

76

HOME

O−MA

0027

HMG−

MA00

44

bZIP−

MA01

02

bZIP−

MA00

25

bZIP−

MA00

43

REL−

MA01

07

REL−

MA00

23

REL−

MA01

01

bZIP−

MA00

18

bZIP−

MA00

97

FORK

HEAD

−MA0

042

FORK

HEAD

−MA0

040

NUCL

EAR−

MA00

07

NUCL

EAR−

MA01

09

ZN−F

INGER

−MA0

012

ZN−F

INGER

−MA0

010

HMG−

MA00

84

ZN−F

INGER

−MA0

013

FORK

HEAD

−MA0

030

FORK

HEAD

−MA0

033

FORK

HEAD

−MA0

032

FORK

HEAD

−MA0

031

Clustering Tree with w = 10

NUCL

EAR−

MA00

71

NUCL

EAR−

MA00

72

NUCL

EAR−

MA00

66

NUCL

EAR−

MA01

17

bHLH

−ZIP−

MA00

58

bHLH

−ZIP−

MA00

59

TRP−

CLUS

TER−

MA00

50

TRP−

CLUS

TER−

MA00

51

bZIP−

MA01

02

bZIP−

MA00

25

bZIP−

MA00

43

REL−

MA00

22

REL−

MA00

23

REL−

MA01

01

REL−

MA01

07

REL−

MA00

61

REL−

MA01

05

FORK

HEAD

−MA0

042

FORK

HEAD

−MA0

041

FORK

HEAD

−MA0

047

NUCL

EAR−

MA00

07

NUCL

EAR−

MA01

09

MADS

−MA0

001

MADS

−MA0

082

ZN−F

INGER

−MA0

013

FORK

HEAD

−MA0

033

FORK

HEAD

−MA0

030

FORK

HEAD

−MA0

032

ZN−F

INGER

−MA0

010

ZN−F

INGER

−MA0

012

Figure 6.6: Comparison of clustering trees using different motif widths

We see the effect of both of these factors when examining the best partitions for

each width in detail. Table 6.3 shows the five strongest clusters in the best parti-

tions for each of the three motif widths.

The top 5 clusters share many common elements between the different partitions

corresponding to the different width models. The strongest cluster in the w = 6

results contains the NUCLEAR motifs that are present in the strongest motif in

the w = 8 motif as well as the fourth strongest cluster in the w = 10 results, but

98

Page 111: Statistical Techniques for Examining Gene Regulation

Table 6.3: Top five clusters for all three motif widths

w Clus Size Strength Consensus Families Motifs6 1 10 341.3 aGGTCA NUCLEAR (10/10) MA0016 MA0066 MA0071

MA0072 MA0074 MA0117MA0111 MA0113 MA0114MA0115

6 2 5 179.5 CACGTG bHLH-ZIP (4/5) MA0058 MA0059 MA0093MA0104

bHLH (1/5) MA00046 3 4 117.9 TAAACA FORKHEAD (4/4) MA0030 MA0033 MA0032

MA00316 4 4 98.9 CGGAAg ETS (4/4) MA0026 MA0028 MA0062

MA00766 5 3 89.0 TGACGT bZIP (3/3) MA0018 MA0096 MA00978 1 5 187.7 gTAGGTCA NUCLEAR (5/5) MA0066 MA0071 MA0072

MA0111 MA01178 2 5 140.9 gTAAACAa FORKHEAD (4/5) MA0030 MA0033 MA0032

MA0031ZN-FINGER (1/5) MA0013

8 3 3 72.0 GgaTTTCC REL (3/3) MA0023 MA0101 MA01078 4 3 70.9 aCCGGAAg ETS (3/3) MA0028 MA0062 MA00768 5 2 46.0 AAgcGAAA TRP-CLUSTER (2/2) MA0050 MA0051

10 1 6 183.9 GGGgaTTtCC REL (6/6) MA0022 MA0023 MA0061MA 0101 MA0105 MA0107

10 2 4 102.6 acgTAAAcAa FORKHEAD (3/4) MA0030 MA0033 MA0032ZN-FINGER (1/4) MA0013

10 3 3 65.8 aTgTTTgtTT FORKHEAD (3/3) MA0042 MA0041 MA004710 4 2 60.6 gTaGGTCAcg NUCLEAR (2/2) MA0066 MA011710 5 2 55.0 gAAAgcGAAA TRP-CLUSTER (2/2) MA0050 MA0051

as noted from the clustering trees, the w = 6 NUCLEAR cluster contains a larger

number of motifs. This is partly due to our first factor since two of these motifs

are short enough (MA0114 - 9 bps, MA0115 - 7 bps) to be excluded by one or two

of the clustering procedures. The second factor is present in this case as well,

since the extra motif MA0016 has a consensus sequence of (GGGGTCACGg), which

has a central matrix that matches the other NUCLEAR motifs well enough in the

w = 6 case, but not if the central motif is expanded to w = 8 or w = 10.

The FORKHEAD cluster seen in Table 6.3 is also common to all three motif width

partitions, though again some differences are observed. Almost all the other

clusters appearing in Table 6.3 share some common motifs with clusters in the

other best partitions, but in several cases these clusters were not among the five

strongest. The one exception is the ETS cluster, which is fourth strongest for both

99

Page 112: Statistical Techniques for Examining Gene Regulation

w = 6 and w = 8, but not among the w = 10 clusters.

The w = 8 clustering model has been the focus of our investigation because it

serves as a compromise between the w = 6 model and the w = 10 model. The

w = 6 model, though allowing for these extra motifs to be included in the cluster-

ing procedure, has the potential disadvantage that it may not be specific enough

to pick out biologically interesting clusters of motifs ie. many spurious clusters

may result from making the motif width too short. On the other hand, the w = 10

model will have a greatly reduced chance of spurious clusters, but may be too

restrictive for the clustering of similar motifs, in addition to removing a large

porportion (44 out of 116) of short motifs in our dataset from the clustering pro-

cedure.

100

Page 113: Statistical Techniques for Examining Gene Regulation

Chapter 7

Prediction of Co-Regulated Genes

In this final application, we combine the statistical methods for motif discovery

presented in Chapters 2-4 and motif clustering, presented in Chapters 5-6, to pre-

dict sets of co-regulated genes in our target organism, the bacteria Bacillus subtilis.

This combined procedure relies solely on publicly available genomic sequence

information, and thus avoids the limitations of gene expression microarray data

that were mentioned in Chapter 1.

In Figure 7.1, a flowchart of our sequence-based technique is presented along

with the contrasting steps of a typical microarray procedure. Note that the or-

dering of the final two steps are reversed between the two procedures, indicating

that our sequence-based strategy will cluster genes based upon discovered mo-

tifs, whereas the usual microarray strategy clusters genes together before motif

discovery.

In Section 7.1, we present our procedure for forming sets of orthologous genes

between bacterial species related to B. subtilis and our focus on both a Studyset

dataset as well as a whole genome dataset. The application of our statistical

model for motif discovery to the datasets is presented in Section 7.2. In Sec-

101

Page 114: Statistical Techniques for Examining Gene Regulation

Cellular mRNA

Gene Expression

Co-regulated Genes

Motifs

Microarray Experiment

Clustering

Motif Discovery

Whole Genome

Orthologous Genes

Motifs

Co-regulated Genes

Orthologue Detection

Motif Discovery

Clustering

Microarray Experiment Sequence-only Strategy

Figure 7.1: Microarray and sequence-based gene clustering procedures

tion 7.3, we discuss the application of our statistical model for motif clustering

to the motifs discovered in Section 7.2. The results of our clustering procedures

are analysed and validated in Section 7.4 for our Studyset and Section 7.6 for our

whole genome datasets.

7.1 Collection of Orthologous Gene Sets

The complete genome sequences and gene annotations for our target organism,

Bacillus subtilis, and an additional 6 bacterial species (summarized in Table 7.1)

were downloaded from the National Center for Biotechnology Information web-

102

Page 115: Statistical Techniques for Examining Gene Regulation

site (www.ncbi.nlm.nih.gov)

Table 7.1: Bacterial species included in the study

Species Reference Genome NumberSize of Genes

Bacillus anthracis Read et al. (2003) 5.2 Mb 5738Bacillus halodurans Takami et al. (2000) 4.2 Mb 4066Bacillus subtilis Kunst et al. (1997) 4.2 Mb 4103Clostridium acetobutylicum Nolling et al. (2001) 3.9 Mb 3739Clostridium perfringens Shimizu et al. (2002) 3.0 Mb 2660Listeria innocua Glaser et al. (2001) 3.0 Mb 2989Oceanobacillus ihenysis Takami et al. (2002) 3.6 Mb 3496

The complete genome was also available for Listeria monocytogenes (Glaser et al.,

2001), but this species was excluded from the study since it was determined to be

almost identical to Listeria innocua, and therefore would contribute little extra

information.

Orthologous genes between B.subtilis and each of the other species listed in Ta-

ble 7.1 were identified using a reciprocal BLAST best-hit procedure (Remm et al.,

2001) consisting of three basic steps:

1. For each gene in B.subtilis, the gene in the other species that had the most

significant protein sequence similarity was found by using the program

BLASTP (Altschul et al., 1990).

2. For each gene in the other species, the gene in B.subtilis that had the most

significant protein sequence similarity was found, again using the program

BLASTP.

3. Any B.subtilis genes that were matched with the same gene from the other

species in both step 1 and step 2 were classified, along with their matched

103

Page 116: Statistical Techniques for Examining Gene Regulation

gene from the other species, as an “orthologous gene pair”.

For each of the BLASTP matching procedures, an arbitrary threshold for signif-

icance must be specified. We used a very conservative significance threshold of

10−10. The seven species in our study are also summarized by the phylogenetic

tree in Figure 7.2, which gives an indication of which species are more related

to our target organism B.subtilis. To construct this phylogenetic tree, 530 sets

C.acetobutylicum

L.innocua

C.perfringens

O.ihenysis

B.anthracis

B.halodurans

B.subtilis

Figure 7.2: Phylogenetic tree of seven related bacterial species

of orthologous genes that contained all seven species were globally aligned by

ClustalW (Thompson et al., 1994) using the amino-acid sequence of each gene.

Next, a phylogenetic tree for each gene was inferred from protein alignments us-

ing a parsimony optimality criterion with the software program PHYLIP v3.573

(Felsenstein, 1993). Finally, a majority-rule consensus tree was constructed based

104

Page 117: Statistical Techniques for Examining Gene Regulation

on the 530 separate gene phylogenies using PHYLIP (Felsenstein, 1993).

Table 7.2 gives the number and proportion of orthologous genes between B.subtilis

and each species.

Table 7.2: Orthologous gene pairs with B.subtilis

Species compared Number of Proportion ofwith B.subtilis Orthologous Genes Orthologous GenesBacillus anthracis 1128 0.20Bacillus halodurans 1022 0.25Clostridium acetobutylicum 531 0.14Clostridium perfugines 482 0.18Listeria innocua 689 0.23Oceanobacillus ihenysis 962 0.28

The orthologous gene pairs for each gene in B.subtilis were collected across all six

other species into 1516 “orthologous gene sets”.

We collected the regulatory region sequence for each gene in our orthologous

gene sets, which we defined as the 500-bp sequence located immediately up-

stream of the translation start site for each gene . However, this regulatory region

sequence was not allowed to overlap with the coding sequence of the previous

gene in the genome. The rationale behind this restriction is that the coding re-

gions of genes is that they rarely contain binding sites for regulatory proteins,

and even when they do, the coding regions will show a generally high degree of

conservation across species, making the discovery of conserved elements more

difficult.

In the case where the end of the coding sequence of the previous gene was within

500 basepairs of the translation start site, the regulatory region sequence would

be restricted to only the sequence between the two coding regions (ie. the inter-

105

Page 118: Statistical Techniques for Examining Gene Regulation

genic region). In bacterial genomes, genes are often organized into operons, which

consist of several genes that are transcribed together under the control of a single

regulatory region. To avoid including sequences contained within an operon, any

sequences which had an intergenic length of less than 50 basepairs were excluded

from the study.

The number of sequences in each orthologous gene set (OGS) varied due to both

these length restrictions as well as the fact that orthologous genes were found in

some species but not others. At least two sequences were required to form an

orthologous gene set, a B.subtilis sequence and at least one sequence in another

species.

A second dataset considered in this investigation was a subset of 172 ortholo-

gous gene sets for which verified TF binding sites in B.subtilis were known. This

“Studyset” was used to validate and fine-tune our methods of motif discovery

and motif clustering before applying these methods to the full OGS dataset.

The distributions of the number of sequences in the full or “whole genome” OGS

dataset and the “Studyset” OGS dataset are given in Table 7.3 below.

Table 7.3: Sequence distributions for each dataset

Whole Genome StudysetX Number with Proportion with Number with Proportion with

X seqs X seqs X seqs X seqs2 163 0.11 14 0.083 405 0.27 40 0.234 388 0.26 47 0.275 256 0.17 23 0.136 171 0.11 28 0.167 133 0.09 20 0.12

Total 1516 1.00 172 1.00

106

Page 119: Statistical Techniques for Examining Gene Regulation

7.2 Motif Discovery

The upstream regulatory regions for each orthologous gene set enumerated in

Table 7.3 forms a small (2-7 sequences with a maximum length of 500 bps each)

sequence dataset which we hypothesize contains TF binding motifs that have

been conserved by evolution.

We will apply the motif discovery strategies outlined in Chapters 2-4 to find these

conserved motifs. Specifically, our motif discovery procedure involves the motif-

finding program BioProspector (Section 1.4) and our scoring function optimiza-

tion algorithm, BioOptimizer (Chapter 3). BioProspector was used as the motif-

finding program because it has the capability to find both one and two-block

motifs.

However, our motif discovery techniques (Chapters 2-4) have previously focused

on the situation where a single conserved motif is present in a sequence dataset,

whereas each of these OGS sequence datasets could contain multiple different

conserved motifs. We will adapt our single-motif methods to this multiple mo-

tif situation by using an iterative-masking strategy. The best single motif will be

found by our single-motif methods, and then this motif will be removed from

the dataset and our single-motif methods reapplied to the modified dataset. This

process can be repeated several times in order to find multiple motif signals. This

same strategy is applied in the program AlignAce (Roth et al., 1998) and is men-

tioned in Section 2.5 on multiple motif strategies.

For a particular OGS sequence dataset, the following motif discovery procedure

was applied to find conserved one-block motifs.

107

Page 120: Statistical Techniques for Examining Gene Regulation

1. The motif-finding program BioProspector (Liu et al., 2001) was used to find

the top five one-block motifs. Since the motif width w must be pre-specified

for BioProspector, the program was run separately for 12 different widths

(8, 10, . . . , 30). For each width, the top five motifs were collected.

2. Since BioProspector is a stochastic motif-finding algorithm, independent

runs of the program may give different results. To account for this fact, we

repeated Step 1 three times for each width, resulting in a total of 3×5×12 =

180 BioProspector motifs, many of which might be identical or close to iden-

tical.

3. Each of these 120 motifs was separately scored and optimized using BioOp-

timizer. BioOptimizer also allows the motif width to vary in order to find

the best possible motif signal. The motif with the highest BioOptimizer

score was retained as the “best motif”.

4. BioOptimizer also calculates a “null score” (Section 3.5) which is the score

that a motif with no sites would have in the given sequence dataset. If the

best motif had a score less than the null score, then it was removed from

consideration and the motif discovery procedure for that OGS sequence

dataset was stopped.

5. If the best motif had a score greater than the null score, it was retained for

the motif clustering procedure to follow. This best motif was then “masked

out” of the sequence dataset by replacing all the motif binding sites with

characters that are ignored by the BioProspector and BioOptimizer pro-

grams.

6. With this new “masked” sequence dataset, the entire motif-finding proce-

dure (steps 1-5) until no motifs are discovered that have a BioOptimizer

108

Page 121: Statistical Techniques for Examining Gene Regulation

score greater than the null score.

Since each discovered motif is iteratively masked out of the sequence dataset, we

avoid re-discovering the same strong motif signal over and over again. The null

score cut-off criterion helps to avoid the discovery of weak motif signals that are

not biologically relevant. Applying this iterative-masking one-block motif dis-

covery strategy to each OGS sequence dataset separately results in several dis-

covered one-block motifs (summarized as count matrices) associated with each

orthologous gene set. Our entire motif discovery procedure is summarized in

Figure 7.3. Since our final goal is to cluster B. subtilis genes using similarity

of discovered motifs, we require any discovered motif contain at least one pre-

dicted B.subtilis site. This criterion is necessary because neither BioProspector

nor BioOptimizer are restricted to find sites in every sequence, and we do not

want to include any discovered motifs that are not present in B.subtilis.

An iterative-masking motif discovery procedure was also performed to find two-

block motifs with variable gaps. The two-block procedure was virtually identical

to the one-block procedure, except that in the first step, the top five two-block

motifs were found by BioProspector separately for 7 different two-block widths

(8−8, 10−10, . . . , 20−20) with a gap range of 12-15 bps, for a total of 3×5×7 = 105

two-block motifs that were subsequently optimized by BioOptimizer. The gap

range of 12-15 bps was used because that length roughly corresponds to a single

rotation of the DNA double helix, so that the two blocks of our motif are on the

same edge of the DNA double helix.

109

Page 122: Statistical Techniques for Examining Gene Regulation

Figure 7.3: Flowchart for motif discovery procedure

7.3 Clustering Genes Based on Discovered Motifs

The one-block and two-block motifs found within each OGS sequence dataset

were combined to form a large collection of motifs, which will be clustered using

the motif clustering model described in Chapter 5.

For the 172 OGSs in the Studyset dataset, we found 81 one-block motifs and

168 two-block motifs that met our criterion (BioOptimizer score greater than null

score) for inclusion in the clustering procedures. For the 1516 OGSs in the whole

110

Page 123: Statistical Techniques for Examining Gene Regulation

genome dataset, we found 1025 one-block motifs and 1416 two-block motifs that

met our criterion for inclusion in the clustering procedures. As mentioned in

Section 5.5, we have two clustering strategies for handling the two-block motifs.

In the Independent-Block strategy, we treat each of the two-block motifs as two sep-

arate and independent single block motifs and cluster these single block motifs

together with the discovered one-block motifs, for a total of 417 independent-block

motifs in the Studyset dataset, and 3466 independent-block motifs in the whole

genome dataset. Our one-block independent motifs were clustered based on a

”central motif”, though each central motif is allowed to shift within the raw mo-

tif matrices, as outlined in Section 5.4.

In the Joint-Block strategy, we treat each of the two blocks of a two-block motif to-

gether as one joined motif, and cluster these two-block motifs separately from the

discovered one-block motifs, for a total of 168 joint-block motifs in the Studyset,

and 1416 joint-block motifs in the whole genome dataset. Our joint-block motifs

were clustered based on a ”central motif” in each of the two blocks, with shifting

again allowed as described in Section 5.4.

Following the guidelines mentioned in Section 6.6 for our TF dataset, our indepen-

dent-block clustering model was implemented with a central motif of 8 bps. A

longer motif width might be too restrictive, especially when considering that the

small size of each OGS sequence dataset may lead to generally weaker motif sig-

nals. However, a shorter motif width might not be restrictive enough, result-

ing in too many spurious clusters that are not biologically relevant. In the joint-

block clustering model, both blocks contribute to the clustering probabilities, so

a shorter central motif of 6 bps was used in either block of the two-block motif

matrices.

111

Page 124: Statistical Techniques for Examining Gene Regulation

We used the same implementation procedure for our clustering model as de-

scribed in Chapter 6, except that this application has two different datasets (study-

set and whole genome), each with two sets of motifs (independent-block and

joint-block). Multiple Gibbs sampler chains were run from several different ini-

tial configurations described in Chapter 6, and convergence was evaluated using

the R statistic as described at the beginning of Chapter 6.

Our results for both clustering procedures are described in Sections 7.4-7.5 for

the studyset dataset, and Sections 7.6-7.7 for the whole genome dataset. Our

predicted clusters are subjected to detailed examination, and are also evaluated

using four external validation measures described in the following section.

7.3.1 Validation of Gene Clusters

In order to evaluate our motif discovery and clustering procedures in terms of

ability to predict co-regulated gene clusters, we constructed several validation

measures based upon external information:

1. Functional Category Over-Representation

2. Known TF Over-Representation

3. Gene Expression: Median Within-Cluster Correlation

4. Gene Expression: Average Within-Cluster Variance

The first validation measure is to examine whether or not the predicted clus-

ters tend to contain genes with the same function. Kunst et al. (1997) classified

B.subtilis genes using a set of functional categories, which are available on the

112

Page 125: Statistical Techniques for Examining Gene Regulation

Subtilist website (Moszer et al., 1995). The functional categories can be tabulated

across all genes in each cluster, and the program GeneMerge (Castillo-Davis and

Hartl, 2003) can be used to calculate a p-value for over-representation of a partic-

ular functional category in a given cluster. This p-value is calculated under the

assumption of a Hypergeometric distribution with a Bonferroni correction.

If our clustering procedure is effective, we expect to find over-representation of

functional categories in our predicted clusters. This measure is limited by the

granularity and imperfections in the functional classifications.

The second validation measure is to examine whether or not the predicted clus-

ters tend to contain genes that are known to be controlled by the same transcrip-

tion factor (TF) protein. A list of 650 known TF-gene interactions from the DBTBS

database was provided (Makita et al., 2004) and used to tabulate known TF inter-

actions for genes within each predicted cluster.

Again, the program GeneMerge (Castillo-Davis and Hartl, 2003) can be used to

calculate a p-value for over-representation of interactions with a particular TF

protein in a given cluster. If our clustering procedure has been effective, we ex-

pect to find that a large proportion of genes in a predicted cluster will have an

interaction with a known TF. This measure is limited, however, by the small size

of our interaction list, which presumably catalogues only a miniscule fraction of

the true gene-TF interactions.

Another validation measure is to examine gene expression patterns within pre-

dicted clusters to see if genes within particular clusters are co-expressed across

a variety of conditions. Our expression dataset consists of ratios of differential

expression on cDNA microarrays from seven different experimental conditions

in B.subtilis (Conlon et al., 2004). For a particular cluster, we considered two dif-

113

Page 126: Statistical Techniques for Examining Gene Regulation

ferent measures of microarray co-expression.

The first expression measure was to calculate the median pairwise correlation S

within a cluster. The Pearson correlation was calculated between each possible

set of two genes in a particular cluster, and then the median value of these corre-

lations was taken to be the median within-cluster correlation, S. Since two genes

in the same cluster might be regulated by the same TF but in opposite ways (one

repressed while the other is promoted), we use the absolute value of the correla-

tion.

The second expression measure is the average within-cluster variance, which is

calculated for a cluster with n members as

T =1

7

7∑

i=1

[

1

n

n∑

j=1

(xij − xi)2

]

(7.1)

where xij is the differential expression ratio for gene j in experimental condition

i, and xi is the differential expression ratio in experimental condition i averaged

over all genes in the cluster.

If our motif discovery and clustering procedure has effectively parsed our genes

into co-regulated clusters, then we would expect low values of the average within-

cluster variance, T , and high values of the median within-cluster correlation, S.

We can estimate p-values for S and T for our predicted clusters by simulation

ie. comparing our observed S and T to many values of S and T calculated for

randomly-generated clusters. Since S and T will depend on the size of the pre-

dicted cluster, we simulated many random values of S and T for each possible

cluster size.

The main weaknesses of these expression-based measures is the limited number

of experimental conditions present in our dataset, as well as the inherent sources

114

Page 127: Statistical Techniques for Examining Gene Regulation

of noise present in microarray data (reviewed by Tseng et al. (2001)).

Despite the fact that each measure has particular limitations, the use of several

measures simulateneously should give us a good idea of the effectiveness of our

motif discovery and clustering procedure.

7.4 Studyset Clustering Results

The overall studyset clustering results can be examined in the form of a clustering

tree (Section 6.1). As an example, the clustering tree for our 168 joint-block motifs

is given in Figure 7.4. Motifs are labelled in Figure 7.4 as genename-motifnum

since more than one motif was found in the upstream region of the OGS for many

of the B.subtilis genes.

The length of the branches is equal to 1 − pij where pij is the pairwise posterior

clustering probability of motif i and j. For example, the two motifs bsaA-m1 and

ywaC-m1 in the lower-left corner of the plot have a pij ≈ 1 of clustering together,

but have pij ≈ 0 of clustering with any other motifs in the dataset.

The large number of motifs under consideration in this project limits the amount

of information that one can gain from a tree over all the motifs. The clustering

tree for our 417 independent-block motifs is far too dense to be a useful visual

summary of the clustering results.

The best partition for each strategy, calculated according to Section 6.2, resulted

in 125 independent block clusters (containing 374 out of 417 independent block

motifs) and 44 joint block clusters (containing 112 out of 168 joint block motifs).

Our best clustering partitions were then filtered to remove any motifs that had

115

Page 128: Statistical Techniques for Examining Gene Regulation

dh

bA

−m

1yq

kL

−m

1le

vD

−m

1b

sa

A−

m1

yw

aC

−m

1yo

cE

−m

1sp

oIV

A−

m1

ftsA

−m

1co

tH−

m1

na

rK−

m1

np

rE−

m1

ha

g−

m1

lctE

−m

2b

mrU

−m

1ycd

H−

m1

yciC

−m

1yu

mD

−m

1p

urE

−m

1p

urA

−m

1p

urR

−m

1ytiP

−m

2yq

hZ

−m

2yq

eZ

−m

1a

brB

−m

1io

lR−

m1

yh

aR

−m

1g

lnR

−m

1a

brB

−m

2a

ckA

−m

1ch

eV

−m

1ytx

G−

m1

yku

N−

m1

sp

oIV

B−

m1

sp

oV

ID−

m1

glv

A−

m1

cw

lC−

m1

mo

tA−

m1

yb

aN

−m

1ly

tD−

m1

clp

P−

m1

lytE

−m

1co

mG

A−

m1

sp

oV

T−

m1

lytR

−m

1yq

fZ−

m1

ycg

F−

m1

sp

oIV

FA

−m

1h

em

A−

m1

sig

H−

m1

yciC

−m

2yku

M−

m1

hu

tP−

m1

ysd

B−

m1

yxjC

−m

1yd

aR

−m

1ylb

J−

m1

ye

bB

−m

1xp

t−m

1sp

oV

K−

m1

gn

tR−

m1

yq

hZ

−m

1a

cu

A−

m1

dp

s−

m1

sp

oIIR

−m

1co

mC

−m

1ytiP

−m

1m

ed

−m

1d

ltA

−m

2b

glP

−m

2d

acB

−m

1yo

eA

−m

1d

ltA

−m

1p

rkA

−m

1ssp

D−

m1

arg

C−

m1

yw

oA

−m

1yce

C−

m1

yh

cR

−m

1m

ta−

m1

sp

oIID

−m

1m

mg

A−

m1

msm

X−

m1

yth

P−

m1

yq

fC−

m1

yu

xL

−m

1ycsN

−m

1yd

cA

−m

1n

arK

−m

2p

yrP

−m

1yu

nB

−m

1yp

jB−

m1

sp

oIV

CA

−m

1ya

bG

−m

1sp

oIIP

−m

1yd

aP

−m

1d

acF

−m

1yjb

C−

m1

ytv

I−m

1le

xA

−m

1d

ra−

m1

ssp

B−

m1

rocR

−m

1sig

W−

m1

bg

lP−

m1

citZ

−m

1a

co

R−

m1

fnr−

m1

lon

B−

m1

sp

oIV

A−

m2

op

uE

−m

1yis

K−

m1

ge

rE−

m1

sp

oIIE

−m

2ytg

A−

m1

me

cA

−m

1yp

iB−

m1

co

tX−

m1

sp

oIIM

−m

1yjb

F−

m1

bo

fA−

m1

gly

A−

m1

ye

bB

−m

2a

raA

−m

1vp

r−m

1 yo

bO

−m

1co

tE−

m1

sp

oV

B−

m1

ykrQ

−m

1ka

tX−

m1

sp

oIIG

A−

m1

citB

−m

1n

rgA

−m

1w

ap

A−

m1

yo

bO

−m

2yrv

J−

m1

yd

cC

−m

1a

co

A−

m1

yte

I−m

1yh

dM

−m

1co

mK

−m

1re

cA

−m

1lrp

C−

m1

sp

o0

F−

m1

Figure 7.4: Clustering tree for studyset joint-block motifs

individual clustering probabilities (Section 6.3) that were less than 0.75. As well,

the independent block clusters were filtered to remove any “redundant” motifs

ie. two motifs from the same gene in the same cluster, which may have arisen

from one of the blocks of a two-block motif found in a particular OGS being

similar enough to cluster to a one-block motif found in that same OGS (since the

motifs of two separate one-block and two-block motif-finding procedures were

combined). This redundancy was not an issue in the joint block clustering, since

only the two-block discovered motifs were used for this dataset, and the motif

discovery procedure was designed to avoid redundant motifs.

116

Page 129: Statistical Techniques for Examining Gene Regulation

After filtering out redundant motifs and motifs with low individual clustering

probabilities, our independent block partition was reduced to 97 clusters con-

taining 271 motifs, while the joint block partition was reduced to 40 clusters con-

taining 99 motifs. A graphical representation of our clustering procedure for the

studyset dataset is given in Figure 7.5.

172 Orthologous Genes

81 One-Block Motifs 168 Two-Block Motifs

417 Indblock Motifs 168 Jointblock Motifs

125 Indblock Clusters 44 Jointblock Clusters

97 Indblock Clusters 40 Jointblock Clusters

40% significant on at least one measure

45% significant on at least one measure

10% significant on multiple measures

10% significant on multiple measures

Motif Discovery

Clustering Clustering

Filtering Filtering

Evaluation Evaluation

Further Evaluation Further Evaluation

Joined BlocksIndependent Blocks

Figure 7.5: Flowchart for studyset motif clustering procedure

Figure 7.6 gives the distribution of cluster sizes for both the independent-block

and joint-block best partitions. We see that the joint-block clustering tends to

117

Page 130: Statistical Techniques for Examining Gene Regulation

produce a higher proportion of small clusters, especially clusters that have only

two motif members. This is also reflected in the average cluster size, which is 2.8

motifs for the independent-block best partition and 2.5 motifs for the joint-block

best partition.

2 3 4 5 6 7 8

Cluster Sizes − Best Partition − Studyset Indblock − DP

010

2030

4050

2 3 4 5 6 7 8

Cluster Sizes − Best Partition − Studyset Jointblock − DP

05

1015

2025

Figure 7.6: Distribution of cluster sizes for studyset best partitions

Both the independent block and joint block clusters were examined using the four

validation measures introduced in Section 7.3.1. All predicted clusters that were

significant (at a α = 0.1 level) on any of the four validation measures are shown

in Table 7.4. The independent block clusters are given first, followed by the joint

block clusters, and the list of clusters are ordered by the cluster strength statistic

described in Section 6.3, which is also given.

118

Page 131: Statistical Techniques for Examining Gene Regulation

Table 7.4: Significant studyset predicted clusters

clus size str S p T p func num p TF num p multInd 2 5 133.7 Metabolism-nucs 3/5 0.002 PurR 5/5 0.000 ***Ind 3 5 127.5 Metabolism-nucs 3/5 0.002 PurR 5/5 0.000 ***Ind 4 4 123.8 Zur 2/4 0.002Ind 5 6 113.4 Sporulation 4/6 0.029Ind 12 4 69.8 0.04 0.056Ind 13 3 67.1 0.88 0.053 RocR 2/3 0.006 ***Ind 16 4 58.8 0.94 0.011Ind 18 3 52.0 0.02 0.026Ind 20 3 51.7 0.02 0.046Ind 21 3 51.2 0.94 0.028Ind 22 3 48.3 0.84 0.076 Sporulation 3/3 0.013 ***Ind 23 3 47.8 PurR 2/3 0.031Ind 26 3 45.0 1.00 0.002 SigE 3/3 0.022 ***Ind 31 3 42.9 1.00 0.002Ind 32 3 42.9 0.88 0.054Ind 33 3 41.9 SigW 2/3 0.062Ind 36 3 41.3 0.84 0.076Ind 39 3 40.2 SigW 2/3 0.062Ind 41 3 39.2 0.89 0.047 0.02 0.04 SigW 2/3 0.062 ***Ind 43 2 37.6 0.01 0.095 DinR 2/2 0.000 ***Ind 47 3 33.9 Transport/bindi 2/3 0.047Ind 49 2 30.2 SigA 2/2 0.065

TnrA 2/2 0.000Ind 50 2 29.9 Adaptation 2/2 0.001 CtsR 2/2 0.000 ***Ind 54 2 25.2 CcpA 2/2 0.050Ind 55 2 25.1 SigB 2/2 0.017Ind 56 2 25.0 PurR 2/2 0.004Ind 58 2 24.4 Sporulation 2/2 0.055Ind 63 2 23.1 Sporulation 2/2 0.055Ind 70 2 21.5 0.00 0.031 AbrB 2/2 0.014 ***Ind 71 2 21.5 RNA-synthesis 2/2 0.020Ind 77 2 20.4 0.98 0.040Ind 83 2 19.2 0.00 0.011Ind 84 2 19.1 SigB 2/2 0.017Ind 85 2 19.0 0.96 0.066Ind 87 2 18.9 ComK 2/2 0.007Ind 89 2 18.8 Transport/bindi 2/2 0.008 YqhN 2/2 0.000 ***Ind 90 2 18.7 0.00 0.037Ind 94 2 16.6 0.93 0.081Ind 96 2 14.9 0.92 0.083Joint 1 5 186.0 Metabolism-nucs 3/5 0.002 PurR 5/5 0.000 ***Joint 3 4 95.3 0.02 0.014Joint 4 4 92.9 0.87 0.026Joint 5 3 67.1 Sporulation 3/3 0.012Joint 6 3 65.4 Cell-Wall 2/3 0.021Joint 7 3 65.3 0.95 0.024Joint 10 3 63.8 Transport/bindi 2/3 0.042Joint 13 3 58.1 0.02 0.029Joint 15 2 40.7 Transport/bindi 2/2 0.007 Zur 2/2 0.000 ***Joint 17 2 35.4 0.99 0.031 Sporulation 2/2 0.054 SigE 2/2 0.062 ***Joint 20 2 33.3 1.00 0.022 Sporulation 2/2 0.054 SigE 2/2 0.062 ***Joint 21 2 32.9 0.96 0.066Joint 27 2 31.2 0.01 0.089Joint 28 2 30.3 CcpA 2/2 0.069Joint 31 2 28.9 0.96 0.071Joint 32 2 28.9 RNA-synthesis 2/2 0.021Joint 33 2 28.4 0.00 0.014Joint 39 2 26.1 SigD 2/2 0.001

In addition to cluster size, strength, consensus sequence, and number of sites

(|A|) in each significant cluster, the measure on which the cluster is significant is

also given. If either of the expression measures T or S is significant, than the value

of T or S is given, along with the p-value calculated as described in Section 7.3.1.

If the functional category over-representation is significant for a predicted cluster,

that functional category and the proportion of genes with that category are given,

119

Page 132: Statistical Techniques for Examining Gene Regulation

along with the p-value. If the TF over-representation is significant, then the TF is

given, along with the proportion of genes in the cluster that are regulated by that

TF, and the p-value for the over-representation, as described in Section 7.3.1.

Of the independent block predicted clusters, 39 out of 97 (40 %) were significant

on at least one of the validation measures. The proportion of significant joint

block clusters was slightly higher, with 18 out of 40 (45 %) predicted clusters

being significant on at least one measure.

Several clusters are significant on multiple measures, in which case we are even

more confidant that these clusters are biologically relevant. There were 10 out of

97 (10 %) of independent block clusters that were significant on multiple mea-

sures, and 4 out of 40 (10 %) of joint block clusters were significant on multiple

measures. These are indicated by a “***” symbol in Table 7.4.

It could perhaps be argued that since we are using multiple validation measures,

each with a significance threshold of α = 0.1, we might expect to find a high

number (up to 40%) of clusters to be significant simply by chance. However,

although the chance of a cluster being significant on at least one measure is high,

the chance of being significant on more than one measure is quite low.

Assuming independence between measures, the probability of being significant

on two measures is only 1%, and is only 0.1% for being significant on three mea-

sures. As given above, we observe a much higher rate of significant clusters on

multiple measures than would be expected by chance, in both the independent

and joint block clustering partitions.

Examining Table 7.4, there does not appear to be a very strong relationship be-

tween the cluster Strength measure and the significance of the cluster. For both

120

Page 133: Statistical Techniques for Examining Gene Regulation

the joint and independent block clusters, it seems that both low-ranking and

high-ranking clusters (in terms of Strength) appear in the table of significant clus-

ters.

7.5 Detailed Examination of Studyset Clusters

In Figure 7.7, we present a graphical representation of all clusters that share at

least two genes between the independent-block and joint-block results. This

graph was created using the GraphViz software package (Gansner and North,

1999).

Each independent-block cluster is represented by elliptical nodes for each gene

connected by a dark line to a diamond which represents the TF that regulates

that gene cluster. Each joint-block cluster is represented similarly, except that

light lines are used and the TF is represented by a rectangle. The TF nodes are

labelled with an “i” in the case of independent-block clusters, and “j” in the

case of joint-block clusters. For each TF node label, the consensus sequence for

that cluster is also given. If one of the clusters in the graph was significant on at

least one of the validation measure, that node was given a double-lined border

instead of a single-line border. The TF nodes are numbered in the same order as

their cluster strength, eg. iTF1 is the independent-block cluster with the highest

strength.

There are several types of interesting relationships summarized in Figure 7.7. We

see cases (eg. jTF31 and iTF85, jTF40 and iTF72) where identical clusters

were predicted by both the independent and joint-block procedures. The first

of these cases is significant on the correlation-based gene expression measure S.

121

Page 134: Statistical Techniques for Examining Gene Regulation

iTF1aAaaGGgG

yqfC

ycsN

ydaP

ydcA

ypiB

yuxLnarK

bglP

iTF2AatgTTCG

purR

purE

ytiPyumD

purA

iTF3CGaAcaTT

iTF4AATcATTA

ycdH

yciC

yqkL

dhbA

iTF5caccTcCt

yabG

ftsA

yobO

spoIIM

spoIIP

cotH

iTF10TttCttca

opuE

yisK

lonB

wapA

iTF11cttTTtTC

spoIIAA

araA

bofA

glyA

iTF18ATtATAca

sigH

yebB

ysdB

iTF19tcctcaGC

spoIID

yhcR

argC

ywoA

iTF20acTTttTT

lrpC

spoIIGA

citB

iTF21gtAaGgAG

gerE

ytxG

iTF72AATagtat

yqhZ

acuA

iTF76TtAcaTga

glnR

ackA

iTF85tagacgTT

bsaA

ywaC

iTF97GccTagaC spoIIR

jTF1CGaAca--tgTtCG

jTF2taatAa--AAAGGg

jTF3AtgTcA--TtAcaT

abrB

yhaR

jTF4TttcTt--AaGgag

spoIVAjTF5ctcCTt--aaGgag

spoIVCA

jTF7ttTtTC--Aaaaac

yjbF

jTF13TTttTT--AtaCTt

jTF15CGTAAT--tAttat

jTF19ACcTcC--TcCatt

jTF21GTaaAa--ATtATA

jTF24aCgtTt--GtCtag

jTF29ctgagC--gCaGaa

jTF31ccgCta--gacgTT

jTF40AATagt--aaAggg

Figure 7.7: Graph of connected studyset clusters

An additional pair of clusters (iTF18 and jTF21) are identical except for the

additional sigH gene in iTF18 and these two clusters are also significant on the

correlation-based gene expression measure S.

A particularly interesting result is that the genes in the strongest cluster (jTF1)

in the joint block set of clusters are identical with the set of genes in the second

(iTF2) and third (iTF3) strongest independent block clusters. In this case, the

same genes that clustered together based on a two-block upstream motif also

122

Page 135: Statistical Techniques for Examining Gene Regulation

clustered together based on both blocks of this upstream motif separately. These

identical clusters were significant on multiple measures (over-representation of

functional categories and over-representation of a particular TF), providing strong

evidence that this cluster has biological relevance. The over-represented TF is

PurR, which was examined in Saxild et al. (2001) and found to bind each of the

genes in this cluster (purR, purE, ytiP, yumD, purA) and to have a two-block

motif with a highly conserved CGAA segment in the first block and TTCG in the

second block, which is verified by the consensus sequence for our joint-block

cluster: CGaAca--tgTtCG.

Not surprisingly, PurR is also known to be involved in the in the purine biosyn-

thetic pathway in Bacillus subtilis (Saxild et al., 2001), which confirms the over-

represented functional category, Nucleotide Metabolism. One of the PurR genes,

ytiP, is also clustered with the gene spoIIR in two identical clusters (iTF97 and

jTF24), although neither of these clusters is significant on any of the validation

measures.

According to the DBTBS database (Makita et al., 2004), spoIIR is regulated by the

transcription factor σF, but is unknown if σF also regulates the gene ytiP. The

consensus sequence for σF is given in Table 4.5 to be GtaTaaa--tGgcaAtAcTa,

which does not match closely the motifs for either iTF97 or jTF24.

In several other cases, similar clusters were predicted by both the independent

and joint-block procedures, but some genes are only found in either the indepen-

dent or joint block clusters. An example of this is relationship between iTF19

and jTF29 or iTF4 and jTF15, where several more genes are present in the in-

dependent block cluster. jTF3 and iTF76 is an example were more genes are

present in the joint block cluster.

123

Page 136: Statistical Techniques for Examining Gene Regulation

This type of relationship might indicate that the additional independent block

genes are bound by a TF that has a motif which resembles a portion of the joint-

block motif but not the entire joint-block motif. For example, Table 4.5 gives the

consensus sequence of the σE motif as ttgtcaTattt--ttcATAtaatg and the

σK motif as gcACa--gcATAtgaTaa, which share a similar second block but a

very different first block.

Another explanation for this behaviour could be that the joint-block motif in

some of these cases is not a true two-block binding motif, but rather consists

of binding sites for two single-block motifs that occur in close proximity to one

another in each of the genes in the joint-block cluster. In this case, the additional

independent-block motifs would represent genes that are bound by only one of

those TFs, but not the other, and so only are included in the one-block indepen-

dent clustering but not the joint clustering.

Further case-by-case evidence would be needed to confirm these or other theo-

ries. In the case of the clusters iTF4 (containing the genes ycdH, yciC, yqkL and

dhbA) and jTF15 (containing just the genes ycdH and yciC), we have the addi-

tional validation that the two common genes are bound by the TF protein Zur.

Gaballa et al. (2002) analyze the Zur regulon and demonstrate that the genes yciC

and ycdH are bound by Zur, and are in fact the genes with the highest differential

expression ratios from gene-knockout microarray experiments. They describe the

Zur protein as a regulator of genes involved in zinc uptake, which confirms the

over-representation of Transport/Binding proteins in the joint-block cluster.

Gaballa et al. (2002) also present a 28-bp long consensus sequence for the Zur

binding motif AAttTAAATCGTAATcATTacGaTTTAa which was based on four

genes. They note that the central region of this consensus sequence TAATnATTA

124

Page 137: Statistical Techniques for Examining Gene Regulation

is shared by two other transcription factors, PerR and Fur. Examining the con-

sensus sequence of our the iTF1 cluster, we see this same consensus sequence

AATcATTA. This case seems to support the first of the two theories ie. the addi-

tional independent block genes (yqkL and dhbA) have binding motifs resembling

the central region of the Zur motif, but do not have the entire Zur motif. Accord-

ing to the DBTBS database (Makita et al., 2004), yqkL is bound by PerR and dhbA

is bound by Fur, which further confirms this theory.

Most of the common clusters in Figure 7.7 are not interconnected with each other,

but a single large sub-graph is present, connecting 9 different independent and

joint block clusters. Two of the TFs in this subgroup (iTF5 and jTF5) have signif-

icant over-representation of genes with Sporulation functions. Each of these two

clusters contain genes that are bound by either the σK TF or the σE TF, though nei-

ther of the clusters has significant over-representation of these TFs. The consen-

sus sequence of iTF5 (caccTcCt) matchs the first of the two blocks of the motif

for jTF5 (ctcCTt--aaGgag). Although these blocks do not match the known

motifs for σK TF or σE (given above), the second block of jTF5 does match the

known motif for the Ribosomal Binding Site, also known as the Shine-Dalgarno

sequence (Shine and Dalgarno, 1974), which is known to bind close to the binding

sites of σ TFs. It could be that the joint block motif jTF5 is actually a combination

of the second block of a two-block σ TF binding along with the ribosomal binding

site. Several of the other clusters in this large subgroup are significant on either

the variance-based or correlation-based expression measures, but do not show

over-representation of a particular function or TF.

125

Page 138: Statistical Techniques for Examining Gene Regulation

7.6 Whole Genome Clustering Results

The resulting clustering trees, from both the independent-block and joint-block

strategies for the whole genome OGS dataset, were far too dense to be a useful

visual summary of the clustering results.

The best partition for each strategy, calculated according to Section 6.2, resulted

in 798 independent block clusters (containing 3369 out of 3466 independent block

motifs) and 407 joint block clusters (containing 1214 out of 1416 joint block mo-

tifs). Again, both best partitions were filtered to remove any “redundant” motifs

or motifs with individual clustering probabilities less than 0.75. clustering strate-

gies are given in the supplementary materials. A graphical representation of our

clustering procedure for the genome dataset is given in Figure 7.8.

After filtering the independent block best partition was reduced to 692 predicted

clusters containing 2480 motifs, while the joint block best partition was reduced

to 376 clusters containing 1097 motifs.

Figure 7.9 gives the distribution of cluster sizes for both the independent-block

and joint-block best partitions. Similar to our Studyset results, we see that the

joint-block clustering tends to produce a higher proportion of small clusters. The

average cluster size is 3.6 motifs for the independent-block best partition and 2.9

motifs for the joint-block best partition.

Both the independent block and joint block clusters were examined using the

four validation measures introduced in Section 7.3.1. Of the independent block

predicted clusters, 196 out of 692 (28 %) were significant on at least one of the

validation measures. The proportion of significant joint block clusters was lower,

with 104 out of 376 (28 %) predicted clusters being significant on at least one

126

Page 139: Statistical Techniques for Examining Gene Regulation

1516 Orthologous Genes

771 One-Block Motifs 1443 Two-Block Motifs

3466 Indblock Motifs 1443 Jointblock Motifs

798 Indblock Clusters 407 Jointblock Clusters

692 Indblock Clusters 376 Jointblock Clusters

28% significant on at least one measure

28% significant on at least one measure

6% significant on multiple measures

5% significant on multiple measures

Motif Discovery

Clustering Clustering

Filtering Filtering

Evaluation Evaluation

Further Evaluation Further Evaluation

Joined BlocksIndependent Blocks

Figure 7.8: Flowchart for genome motif clustering procedure

measure.

Many of these clusters are significant on multiple measures, in which case we are

even more confidant that these clusters are biologically relevant. There were 41

out of 692 (6 %) of independent block clusters that were significant on multiple

measures, and 17 out of 376 (5 %) of joint block clusters. Just as in the Studyset re-

sults (Section 7.4), these multiple significance figures are much higher than would

be expected by chance. All clusters which were significant on multiple measures

127

Page 140: Statistical Techniques for Examining Gene Regulation

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Cluster Sizes − Best Partition − Genome Indblock

5010

015

020

0

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Cluster Sizes − Best Partition − Genome Jointblock

5010

015

020

0

Figure 7.9: Distribution of cluster sizes for whole genome partition

are shown in Table 7.5.

The independent block clusters are given first, followed by the joint block clus-

ters, and each are ordered by cluster strength. In addition to cluster size, strength,

consensus sequence, and number of sites (|A|) in each significant cluster, the mea-

sure on which the cluster is significant is also given.

7.7 Detailed Examination of Whole Genome Clusters

All genes in our whole genome dataset were examined for relationships with

other genes within either the independent block or joint block clustering results,

as well as common relationships within both.

128

Page 141: Statistical Techniques for Examining Gene Regulation

Table 7.5: Genome clusters significant on multiple measures

clus size str S p T p func num p TF num pInd 2 15 409.5 0.58 0.036 SigE 2/15 0.068Ind 13 7 224.2 0.60 0.090 Transport/bindi 5/7 0.001 Fur 4/7 0.000Ind 14 8 213.9 Metabolism-nucs 3/8 0.004 PurR 5/8 0.000

DinR 2/8 0.001Ind 19 9 193.8 Metabolism-carb 3/9 0.090 CcpA 2/9 0.061

SigA 3/9 0.032Ind 22 7 191.4 Metabolism-nucs 3/7 0.002 PurR 6/7 0.000Ind 25 8 182.1 0.06 0.064 CcpA 3/8 0.001Ind 27 6 179.8 0.70 0.033 Transport/bindi 3/6 0.067 SigA 2/6 0.055

Fur 4/6 0.000Ind 38 7 158.3 0.04 0.037 Protein-synthes 4/7 0.000Ind 54 7 133.9 0.63 0.062 Detoxification 2/7 0.077Ind 64 5 126.3 RNA-synthesis 3/5 0.013 RocR 2/5 0.000Ind 86 4 113.1 Adaptation 2/4 0.007 CtsR 2/4 0.000Ind 116 5 98.1 0.65 0.094 Membrane-bioene 2/5 0.021 ResD 2/5 0.001Ind 117 5 97.9 0.69 0.060 Metabolism-aa 2/5 0.098Ind 128 4 95.5 Metabolism-coen 2/4 0.009 SigA 2/4 0.023Ind 130 5 95.1 Transport/bindi 3/5 0.027 SigG 2/5 0.004Ind 147 5 89.7 0.02 0.036 Protein-synthes 2/5 0.031Ind 157 4 85.2 0.03 0.097 Protein-synthes 2/4 0.019Ind 166 4 82.2 0.01 0.006 RNA-synthesis 2/4 0.067Ind 177 5 78.8 Sporulation 3/5 0.008 SigE 2/5 0.008Ind 196 4 73.4 Sporulation 2/4 0.071 SigE 2/4 0.009Ind 213 4 70.0 0.68 0.087 Sporulation 2/4 0.071 SigE 2/4 0.005Ind 216 4 69.9 0.71 0.073 SimilartoBsub 3/4 0.019Ind 237 4 66.9 0.79 0.025 0.02 0.042Ind 239 4 66.8 0.02 0.037 RNA-synthesis 2/4 0.067Ind 247 3 66.0 0.01 0.038 Metabolism-lipi 3/3 0.000Ind 248 3 65.7 Metabolism-coen 2/3 0.003 SigA 2/3 0.012Ind 276 3 60.5 0.99 0.001 Similartoother 2/3 0.080Ind 287 4 56.4 0.67 0.089 0.02 0.021Ind 290 3 55.2 Membrane-bioene 2/3 0.003 ResD 2/3 0.000Ind 295 3 54.0 0.86 0.033 0.00 0 Protein-synthes 2/3 0.006Ind 301 3 52.0 Transport/bindi 2/3 0.058 PurR 2/3 0.000Ind 326 3 48.6 0.02 0.093 Similartoother 2/3 0.040Ind 372 3 44.4 0.02 0.084 Transport/bindi 2/3 0.058Ind 394 3 42.8 0.76 0.097 DNA-modificatio 2/3 0.001Ind 407 3 42.1 0.78 0.087 NoSimilarity 2/3 0.012Ind 424 3 39.9 0.02 0.074 Similartoother 2/3 0.080Ind 435 2 38.1 0.92 0.057 0.00 0.015 DinR 2/2 0.000Ind 439 3 37.7 0.01 0.022 Similartoother 2/3 0.080Ind 456 2 28.9 0.93 0.048 Similartoother 2/2 0.015Ind 480 2 26.0 0.00 0.04 Metabolism-lipi 2/2 0.001Ind 621 2 20.5 0.89 0.080 0.01 0.069Joint 3 8 300.9 Metabolism-nucs 3/8 0.003 PurR 6/8 0.000Joint 42 4 109.4 Sporulation 2/4 0.071 SigE 2/4 0.005Joint 85 3 82.2 Metabolism-carb 3/3 0.000 CcpA 2/3 0.003

SigA 2/3 0.019Joint 86 3 82.0 0.77 0.071 Metabolism-lipi 2/3 0.004Joint 89 3 78.4 0.94 0.008 0.01 0.018Joint 96 3 71.9 Transport/bindi 2/3 0.055 Fur 2/3 0.000Joint 135 3 62.8 0.95 0.005 0.01 0.018 RNA-synthesis 2/3 0.036Joint 147 3 60.3 0.75 0.083 0.02 0.078Joint 154 2 58.4 Metabolism-coen 2/2 0.001 SigA 2/2 0.002Joint 177 2 51.1 Membrane-bioene 2/2 0.001 ResD 2/2 0.000Joint 182 2 41.4 0.01 0.095 Protein-synthes 2/2 0.001Joint 191 2 37.6 0.01 0.059 RNA-synthesis 2/2 0.006Joint 195 2 36.8 1.00 0.005 SigE 2/2 0.002Joint 251 2 32.2 0.92 0.048 0.01 0.1Joint 296 2 29.9 0.93 0.044 0.01 0.039Joint 321 2 28.4 0.00 0.017 SimilartoBsub 2/2 0.018Joint 341 2 27.2 0.01 0.035 Transport/bindi 2/2 0.010

Graphs for every independent and joint-block cluster (with at least two common

genes) in the whole genome partition was too dense to be informative, so we

restricted ourselves to only clusters that were significant on at least one of the

validation measures. Even with this restriction, the graph must be split into two

figures. Figure 7.10 gives a large set of interconnected clusters within the whole

129

Page 142: Statistical Techniques for Examining Gene Regulation

genome best partition. The graph characteristics are the same as in Figure 7.7.

iTF3cccTCCtt

rpoA

ycgA

nasF

ydiG

yetN

ygaI

hemE

yhfP

yhjR

yrrS

yrrM

yutC

yvbU

katX

iTF13ATTaTCAt

yfiY

yfiZyfhC

ykvW

dhbA

yxeB

fhuB

iTF19AAggtGaa

kbaA

adaA

yhxB

ykrU

acsA

acuA

ywnE

ywfM

iTF25TattaTaa

ctsRydeC

ylaN

yrzC

ytkK

dra

iTF27AtAATGAT

yclN

ykuNyrdQ

iTF86CTTTGACT

clpE

ykvI

dnaJ

iTF116ttttcAcA

yjbH

ctaA

ctaB

ylmC

vpr

iTF290ttataTtT

jTF4CtcCtt--TtTTaT

yflLyjbK

ysfA

yvqJ

ywjG

jTF85TAtTaT--AggtGg

yluB

jTF96AATgAT--gAtaat

ycdH

jTF177ttCACA--ataTtTjTF188

TTTGAC--AaaaTa

Figure 7.10: Graph of connected and significant whole genome clusters, part 1

Several of the other central clusters in this group (iTF19, iTF25, and jTF85)

have the significant over-representation of genes bound by the transcription fac-

tor CcpA and two of these clusters iTF19 and jTF85 are over-represented by

genes bound by SigA transcription factor. CcpA is involved in the catabolite re-

pression pathway (Kim and Chambliss, 1997) which was noted by Weickert and

Chambliss (1990) to also be linked to the Sporulation process in B.subtilis. SigA

130

Page 143: Statistical Techniques for Examining Gene Regulation

encodes the primary sigma factor of RNA polymerase and so is a necessary pro-

tein for any cell growth.

iTF19 has a consensus sequence of AAggtGaa, 2 out of 9 genes known to be

under the control of CcpA, and 3 out of 9 genes known to be under the control

of SigA. iTF25 has a consensus sequence of TattaTaa and 3 out of 8 genes

known to be under the control of CcpA. jTF85 has a consensus sequence of

TAtTaT--AggtGg, 2 out of 3 genes under the control of CcpA, and 2 out of 3

genes under the control of SigA.

The literature consensus sequence for SigA (Helmann, 1995) is TTGACA--TATAAT

while the binding motif for CcpA, known as cre (catabolite response element), is

given in by Weickert and Chambliss (1990) to be TGTAAGCGTTAACA. Although

iTF25 and the second block of jTF85 seems to be a reasonable match to the

first block of the SigA motif it is unclear whether iTF19 or the second block of

jTF85 match the motifs for either SigA or CcpA, though these two blocks cer-

tainly match each other. Since many other genes are included in the iTF19 clus-

ters, a possible explanation could be that iTF19 and the second block of jTF85

actually represent the ribosomal binding site, which normally would be in close

proximity to the second block of a Sigma factor motif. This same phenomenon

was postulated in the Studyset results (Section 7.4).

Another notable group in Figure 7.10 are the three clusters iTF116, iTF290 and

jTF177 on the right side of the figure, all of which are over-represented for the

transcription factor ResD. ResD is a transcription factor that, along with ResE,

forms a signal-transduction system with an important role in cellular respiration

Sun et al. (1996), which confirms the functional category Membrane bioenergenics

that is also significantly over-represented in these three clusters. Several genes

131

Page 144: Statistical Techniques for Examining Gene Regulation

are present in this group of clusters (ydiG,ylmC,yjbH,vpr) which seem to have

one of the single block motifs but not the other.

It is also worth noting that the group of three clusters iTF13, iTF27 and jTF96

at the top of Figure 7.10 are all over-represented for the transcription factor Fur

and the iTF27 cluster is also over-represented for the TFs Zur and SigA. This

group of clusters is analogous to the group of Zur/Fur clusters found in the

Studyset, and share several genes in common. In addition, we again see over-

representation of the Transport/Binding functional category, for all three of these

clusters.

The remainder of the connected and significant whole genome clusters not in-

cluded in Figure 7.10 are presented in Figure 7.11. Again, the graph characteris-

tics are the same as in Figure 7.7.

Examining the clusters of Figure 7.11, we again see several of the characteristics

noted in the Studyset analysis: many clusters have several genes in common, but

also several genes that only seem to have either the joint-block or independent-

block motif, but not both. For example, iTF128, iTF248, and jTF154 (near the

middle of the graph) are connected clusters that are all over-represented for sigA,

which is also mentioned in the Studyset results. Both independent-block motifs

resemble either block of the joint-block motif, but each also contain genes (yqfZ,

spoVB, ylxM) that have one of the single-block motifs, but not the other.

The most notable feature of Figure 7.11 is the PurR controlled clusters (iTF14,

iTF22, jTF3) at the bottom of the figure, which have the same set of genes (purR,

purE, ytiP, yumD, purA) present that was also found within the Studyset dataset

(Section 7.4), but now additional genes are also included in this subgraph. The

gene yebB is common across all three of these whole genome clusters, but was not

132

Page 145: Statistical Techniques for Examining Gene Regulation

iTF14CGAAcatt

recAuvrB

purR

yebB

purE

yumD

hom

purA

iTF22AatgTTCG

abrBytiP

iTF38AgGGaGga

ileS

yfhJ

alaS

pheS

ytzA

tyrS

hag

iTF128CTTGaCat

spoVB

ylxM

nadB

nifS

iTF148gtgataac

yfhF

yfhG

yjbQyqkD

ywqE

iTF184AacTctCc

yteI

ytdI

leuS

ald

iTF206ctCCtttT

spoVG

yjbC

ylbP

rho

iTF247TTAgtAcC

yhfB

yjaX

yjbW

iTF248TGTCaaGA

yqfZ

iTF254tCtctTTT

yheI

yqxD

phoA

rpsD

iTF435CatatGTT

lexA

yneA

jTF3CGaAcA--tgTtCG

ydiA

yitG

jTF15CtccTT--AaaaaA

yhxAcsaA

comEA

ysdC

jTF19TCtcct--AaggGa

adk

ydbE

yfiO

ypjB

jTF53TGTTcg--tAtact

malS

ytkP

jTF86ctaaat--TTAgTA

msmR

jTF111CCTttt--tttgaA

sigW

jTF154TCTTGa--tCAAGA

jTF180CtCtTT--cgtttt

jTF182AGGGag--CCcTtt

Figure 7.11: Graph of connected and significant whole genome clusters, part 2

present in the Studyset clusters. There are also several genes that are included in

one of the PurR clusters but not the others: ydiA, yitG, recA, uvrB, hom, and

abrB.

Two of these genes, recA and uvrB, are also bound by the transcription factor

DinR (Makita et al., 2004), and in fact iTF14 which contains these two genes

is over-represented in terms of genes controlled by both PurR and DinR. DinR is

involved (along with recA) in the regulation of the SOS response to DNA damage

133

Page 146: Statistical Techniques for Examining Gene Regulation

in Bacillus subtilis (Winterling et al., 1997). DinR is also over-represented in the

connected clusters iTF435 and jTF53 as well. Winterling et al. (1998) present

the DinR binding sequence as CGAACRNRYGTTYC, which seems to contain parts

of the motif from the iTF435 and jTF53 group (tGTT) as well as the iTF14 and

jTF3 clusters (CGAACA).

134

Page 147: Statistical Techniques for Examining Gene Regulation

Chapter 8

Discussion and Future Work

Motif discovery is an important problem in computational biology because the

binding of transcription factors to upstream region motifs is crucial to the mech-

anism of gene regulation. In Chapter 2, we have presented various techniques

used in the past for motif discovery, a set of Bayesian models useful for devel-

oping motif-finding tools, and generalizations of these models that allow for un-

known motif width w and unknown motif abundance ratio p0. We have also dis-

cussed the use of scoring functions for motif finding. Viewing Bayesian models

in terms of scoring functions has provided insight to the similarities between the

full Bayesian model-based approaches and some non-Bayesian methods, such as

Consensus (Stormo and Hartzell, 1989).

We have introduced a scoring function formulation in Chapter 3, implemented

in the software BioOptimizer, designed to improve the prediction of regulatory

binding motifs. The advantage of scoring functions is that they give us an in-

tuitive means by which to compare different possible configurations of motif lo-

cations and can serve as a framework for the comparative use of several motif-

finding programs, thereby benefiting from the advantages that different motif-

finding programs may offer in different situations. This general approach of us-

135

Page 148: Statistical Techniques for Examining Gene Regulation

ing multiple methods to obtain different estimates of an unknown quantity that

are subsequently compared and improved can be useful beyond models for motif

discovery.

This usefulness of BioOptimizer was demonstrated in Chapter 4 by the uniformly

increased accuracy of predicted sites across the board compared to BioProspector,

Consensus, Meme, and AlignAce. Although BioOptimizer is not guaranteed to

find a global best fit to our model, there is still a significant gain resulting from

its use with very little extra computational time. The best improvements were

obtained from the scoring functions that most closely approximated the posterior

distribution under our full Bayesian model.

BioOptimizer also allows for unknown motif abundance, unknown motif width,

and two-block motifs with variable-length gaps between the blocks. Allowing

the motif width to be inferred from the data has lead to non-conventional results

when applied to datasets for the spo0A binding motif in B.subtilis and the CRP

binding motif in E.coli. The two-block version of BioOptimizer provided inter-

esting results when applied to the search for binding motifs for several σ-factors

in B.subtilis as well as the CRP binding motif. It is seen that the optimal motif

width found by BioOptimizer was often substantially different from our a priori

expectations.

There are still many interesting open problems in this field. The vast majority

of motif-finding research has assumed that all information about the interaction

between transcription factors and their DNA binding motifs can be summarized

just by looking at the one-dimensional nucleotide sequence. Benos et al. (2002b)

and Benos et al. (2002a) discuss one-dimensional nucleotide models and conclude

that although their fit is not perfect, they do provide a very good approximation

136

Page 149: Statistical Techniques for Examining Gene Regulation

to the true nature of protein-DNA interactions. However, in actuality this inter-

action is occurring in three-dimensional space, so ideally motif models should

incorporate characteristics of DNA morphology.

Keles et al. (2003) propose a supervised motif detection method, COMODE, that

takes into account structural information about the DNA-binding protein by con-

straining the motif search to be similar to previously known information content

profiles. As an example, in eukaryotic organisms, DNA is stored in the form

of tightly-compacted chromosomes where substantial portions of the DNA se-

quence is wrapped around proteins called histones. This is important informa-

tion to include in future models, since portions of the sequence that are wrapped

around histones are less free to interact with DNA-binding proteins like tran-

scription factors.

In specific examples, where extra information about the distances between mo-

tif sites and the start of the coding region is available, this information should

be added into the model. McCue et al. (2001) demonstrate that incorporating a

model that takes into account the location of the motif site relative to the end of

each sequence can improve the sensitivity of the algorithm. As mentioned in Sec-

tion 2.5, the multiple motif models of Lawrence et al. (1993) take into account or-

dering information between motifs, but not spacing information. Extending our

scoring function framework to the multiple motif situation while incorporating

both ordering and spacing information between motifs (beyond the two-block

case already handled by BioOptimizer) may provide extra power to detect motifs

for multiple TFs that regulate the same target genes.

Another interesting problem is to establish a model-based approach for incor-

porating gene expression information, such as microarray results, into the motif

137

Page 150: Statistical Techniques for Examining Gene Regulation

discovery problem. Bussemaker et al. (2001) and Keles et al. (2002) both propose

methods for integrating sequence analysis together with microarray information.

The MDscan program mentioned above gives one approach to this problem, since

the upstream regions that are examined for motifs are updated in an iterative

fashion, based on microarray information. A more recent method, Motif Regres-

sor (Conlon et al. 2003) directly uses the microarray expression values to help

screen out false positive findings of MDscan. However, model-based approaches

may still be desirable since these models may provide us a principled way to tune

relevant parameters and guide us to achieve the optimal combination of the two

sources of information (i.e., genome sequences and microarray values).

A Bayesian hierarchical clustering model was introduced in Chapter 5 as a sta-

tistical approach to summarizing the common structure within a collection of

discovered motifs, with a Gibbs sampling implementation. This model has sev-

eral advantages over traditional clustering techniques, such as hierarchical tree

clustering or K-means clustering. The clustering decisions are systematic and

model-based not based on ad hoc similarity measures. The number of clusters is

allowed to vary and does not have to be pre-specified. The hierarchical frame-

work allows us to account for variability in the observed units (motif matrices),

instead of assuming these units are fixed and known.

Another notable element of our clustering procedure is that our model very eas-

ily deals with the alignment issue that, within each raw motif matrix, it is not

obvious where the central motif is located. Our model allows us to condition

on the motif location in all other raw matrices within the dataset when we cal-

culate the most likely location of the motif within a particular matrix. In many

cases, other matrices may show very similar compositions to the matrix in ques-

138

Page 151: Statistical Techniques for Examining Gene Regulation

tion, in which case the conditioning provides a substantial amount of information

pertaining to the motif location. Although we have presented our clustering pro-

cedure in the context of a specific application to motif matrices, these advantages

of our Bayesian clustering model are not specific to this particular type of data.

Bayesian hierarchical clustering models based on a Dirichlet process prior dis-

tribution should be considered an attractive approach, especially in cases where

the number of clusters is not known a priori. The model is easy to implement

using MCMC methods which also allows for a full examination of the posterior

distribution instead of just focusing on a single point estimate.

Our motif clustering model was applied to a dataset of 116 TF binding motifs

in Chapter 6, and several approaches to analysing the clustering results were

discussed. Our posterior draws allowed us to summarize the variability of our

clustering results with a tree structure, as well as allowing us to estimate the best

partition of clusters. In addition to this best partition, we can calculate model-

based statistics to summarize the relative strength of our predicted clusters, as

well as observation-level probabilities for belonging to a particular cluster that

give us an indication of the variability in our point estimate. Two different clus-

tering priors, the Dirichlet process and the Uniform prior, were compared and

found to have quite different a priori clustering characteristics, but when applied

to our TF dataset did not show very different posterior results.

The clustering results that we observed suggests that the motifs within various

TF families can be organized into sub-groups based upon their tendency to clus-

ter together as a consequence of having very similar motifs. An area of future

research is to use the clustering information gained from a collection of motifs

to further improve subsequent motif discovery. A scoring function optimiza-

139

Page 152: Statistical Techniques for Examining Gene Regulation

tion framework was presented in Chapter 3 based on a Bayesian motif discovery

model where very little is known a priori about the appearance of an unknown

motif. However, once a set of motifs has been discovered (and clustered), we

should incorporate this information into motif discovery procedure. One pro-

posal would be to use the posterior predictive distribution from our motif clus-

tering model as the scoring function for motif discovery, which would increase

the ability of our motif-finding algorithms to detect a motif that is similar to mo-

tifs that have already been discovered elsewhere.

In Chapter 7, we combined our techniques for motif discovery and motif clus-

tering to predict co-regulated clusters of genes in the bacteria Bacillus subtilis.

We used the whole genome sequences of seven related bacterial species to dis-

cover transcription factor binding motifs in the upstream regions of Bacillus sub-

tilis genes, and then have used similarities between these discovered motifs to

group these genes into possibly co-regulated gene clusters. This procedure can be

regarded as a sequence-based gene clustering that complements gene clustering

procedures based on microarray gene expression experiments. Our framework

could also be useful organisms for which no microarray chips are available, but

genome sequences from closely-related species is available.

Orthologous genes were identified between Bacillus subtilis and six other bacterial

species, and the upstream regulatory regions of these orthologous gene sets were

examined for elements that were possible transcription factor binding motifs con-

served by evolution, a technique often referred to as phylogenetic footprinting.

Our analysis focussed on two collections of these orthologous gene sets, the first

being a “Studyset” of gene sets for which some TF binding sites are known, and

the second being the “whole genome” of orthologous gene sets.

140

Page 153: Statistical Techniques for Examining Gene Regulation

Our motif discovery strategy, as outlined in Chapters 3- 4, was a combination

of the stochastic motif-finding program, BioProspector and our deterministic op-

timization algorithm, BioOptimizer. The discovered one and two-block motifs

from this procedure were then clustered using the Bayesian hierarchical cluster-

ing model presented in Chapters 5-6. Our strategy of separately clustering two-

block motifs as both independent blocks and joint blocks allowed us to examine

several interesting interactions between one and two-block motif clusters within

both our Studyset (Section 7.5) and whole genome (Section 7.7) datasets. Many

of these relationships are confirmed within the biological literature.

Beyond these detailed examinations, we also performed a systematic evaluation

of our clustering results based on several external measures available for our tar-

get organism, Bacillus subtilis. Each Studyset and whole genome predicted cluster

was examined for over-representation of a particular functional category, over-

representation of a particular known transcription factor, and two gene expres-

sion statistics based on seven microarray experiments. The proportions of clus-

ters that were significant on multiple validation measures was much higher than

would be expected by chance in both the Studyset and whole genome datasets.

One aspect of this investigation that could be improved by further study is the in-

corporation of the concept of evolutionary distances into our motif discovery pro-

cedures. Each sequence within a particular orthologous gene set was weighted

equally with every other sequence by our motif-finding algorithms, despite the

fact that these sequences came from different species with unequal phylogenetic

distances between them. A more sophisticated motif discovery procedure which

incorporates this additional information may have increased power to detect

weaker motif signals.

141

Page 154: Statistical Techniques for Examining Gene Regulation

Bibliography

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic

local alignment search tool. Journal of Molecular Biology 215, 403–410.

Bailey, T. and Elkan, C. (1994). Fitting a mixture model by expectation maximiza-

tion to discover motifs in biopolymers. In Proceedings of the Second International

Conference on Intelligent Systems for Molecular Biology, 28–36, Menlo Park, Cali-

fornia. AAAI Press.

Benos, P., Lapedes, A., and Stormo, G. (2002a). Additivity in protein-dna interac-

tions: how good an approximation is it? Nucleic Acids Research 30, 4442–4451.

Benos, P., Lapedes, A., and Stormo, G. (2002b). Probabilistic code for dna recog-

nition by proteins of the egr family. Journal of Molecular Biology 323, 701–727.

Benson, D., Karsch-Mizrachi, I., Lipman, D., Ostell, J., Rapp, B., and Wheeler, D.

(2002). Genbank. Nucleic Acids Research 30, 17–20.

Brazma, A., Jonassen, I., Vilo, J., and Ukkonen, E. (1998). Predicting gene regula-

tory elements in silico on a genomic scale. Genome Research 8, 1202–1215.

Britton, R., Eichenberger, P., Gonzalez-Pastor, J., Fawcett, P., Monson, R., Losick,

R., and Grossman, A. (2002). Genome-wide analysis of the stationary-phase

sigma factor (σh) regulon of bacillus subtilis. J. Bacteriol. 184, 4881–4890.

142

Page 155: Statistical Techniques for Examining Gene Regulation

Bussemaker, H., Li, H., and Siggia, E. (2000). Building a dictionary for genomes:

identification of presumptive regulatory sites by statistical analysis. Proceedings

of the National Academy of Sciences (USA) 97, 10096–10100.

Bussemaker, H., Li, H., and Siggia, E. (2001). Regulatory element detection using

correlation with expression. Nature Genetics 27, 167–171.

Cardon, L. and Stormo, G. (1992). Expectation maximization algorithm for iden-

tifying protein-binding sites with variable lengths from unaligned dna frag-

ments. Journal of Molecular Biology 223, 159–170.

Castillo-Davis, C. and Hartl, D. (2003). Genemerge – post-genomic analysis, data

mining, and hypothesis testing. Bioinformatics 19, 891–892.

Conlon, E., Eichenberger, P., and Liu, J. (2004). Determining and analyzing differ-

entially expressed genes from cdna microarray experiments with complemen-

tary designs. Journal of Multivariate Analysis .

Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from in-

complete data via the em algorithm. Journal of the Royal Statistical Society, B 39,

1–38.

Eichenberger, P., Fujita, M., Jensen, S., Conlon, E., Rudner, D., Wang, S., Ferguson,

C., Sato, T., Liu, J., and R., L. (2004). The entire program of gene expression for

a single differentiating cell type. PLoS Biology Accepted for publication.

Eichenberger, P., Jensen, S., Conlon, E., van Ooij, C., Silvaggi, J., Gonzalez-Pastor,

J., Fujita, M., Ben-Yehuda, S., Stragier, P., Liu, J., and Losick, R. (2003). The

σe regulon and the identification of additional sporulation genes in Bacillus

subtilis. Journal of Molecular Biology 327, 945–972.

143

Page 156: Statistical Techniques for Examining Gene Regulation

Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). Cluster analysis

and display of genome-wide expression patterns. Proceedings of the National

Academy of Sciences (USA) 95, 14863–14868.

Felsenstein, J. (1993). PHYLIP (phylogeny inference package) version 3.5c Dis-

tributed by the author. Department of Genetics, University of Washington,

Seattle.

Ferguson, T. (1974). Prior distributions on spaces of probability measures. Annals

of Statistics 2, 615–629.

Frith, M., Li, M., and Weng, Z. (2003). Cluster-Buster: Finding dense clusters of

motifs in dna sequences. Nucleic Acids Research 186, 3666–3668.

Gaballa, A., Wang, T., Ye, R., and Helmann, J. (2002). Functional analysis of the

Bacillus subtilis Zur regulon. Journal of Bacteriology 184, 6508–6514.

Galas, D., Eggert, M., and Waterman, M. (1985). Rigorous pattern-recognition

methods for dna sequences. analysis of promoter sequences from escherichia

coli. Journal of Molecular Biology 186, 117–128.

Gansner, E. and North, S. (1999). An open graph visualization system and its

applications to software engineering. Software – Practice and Experience 00, 1–

29.

Gelman, A., Carlin, J., Stern, H., and Rubin, D. (1995). Bayesian Data Analysis.

Chapman and Hall/CRC, Boca Raton, FL.

Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using

multiple sequences. Statistical Science 7, 457–472.

144

Page 157: Statistical Techniques for Examining Gene Regulation

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and

the Bayesian restoration of images. IEEE Transaction on Pattern Analysis and

Machine Intelligence 6, 721–741.

Glaser, P., Frangeul, L., Buchrieser, C., Rusniok, C., Amend, A., Baquero, F.,

Berche, P., Bloecker, H., Brandt, P., Chakraborty, T., Charbit, A., Chetouani,

F., Couve, E., de Daruvar, A., Dehoux, P. a nd Domann, E., Dominguez-Bernal,

G., Duchaud, E., Durant, L., Dussurget, O., Entian, K., Fsihi, H., Garcia-del Por-

tillo, F., Garrido, P., Gautier, L., Goebel, W., Gomez-Lopez, N., Hain, T., Hauf,

J., Jackson, D., Jones, L., Kaerst, U., Kreft, J., Kuhn, M., Kunst, F., Kurapkat,

G., Madueno, E., Maitournam, A., Vicente, J., Ng, E., Nedjari, H., Nordsiek,

G., Novella, S., de Pablos, B., Perez-Diaz, J., Purcell, R., Remmel, B., Rose, M.,

Schlueter, T., Simoes, N., Tierrez, A., Vazquez-Boland, J., Voss, H., Wehland, J.,

and Cossart, P. (2001). Comparative genomics of listeria species. Science 294,

849–852.

Gordon, D., Nekludova, L., Gifford, D., Jaakkola, T., and Fraenkel, E. (2004).

Combining motif discovery algorithms with information from structural and

biochemical databases to understand transcriptional regulation. Submitted for

publication.

Green, P. and Richardson, S. (2001). Modelling heterogeneity with and without

the Dirichlet process. Scandinavian Journal of Statistics 28, 355–375.

Grundy, W., Bailey, T., and Elkan, C. (1996). Parameme: a parallel implementation

and a web interface for a dna and protein motif discovery tool. Comput Appl

Biosci 12, 303–310.

Gupta, M. and Liu, J. (2003). Discovery of conserved sequence patterns using

145

Page 158: Statistical Techniques for Examining Gene Regulation

a stochastic dictionary model. Journal of the American Statistical Association 98,

1–12.

Halberg, R. and Kroos, L. (1994). Sporulation regulatory protein Spoiiid from

Bacillus subtilis activates and represses transcription by both mother-cell-

specific forms of RNA polymerase. Journal of Molecular Biology 243, 425–436.

Hampson, S., Baldi, P., Kibler, D., and Sandmeyer, S. (2000). Analysis of yeast’s

orf upstream regions by parallel processing, microarrays, and computational

methods. In Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, 190–201.

Hartigan, J. (1975). Clustering algorithms. Wiley, New York, NY.

Helmann, J. (1995). Compilation and analysis of Bacillus subtilis σa-dependent

promotor sequences: evidence for extended contact between RNA polymerase

and upstream promotor DNA. Nucleic Acids Research 23, 2351–2360.

Helmann, J. and Moran Jr., C. (2002). Rna polymerase and sigma factors. In

A. Sonenshein, J. Hoch, and R. Losick, eds., Bacillus subtilis and its closest rela-

tives. ASM Press, Washington, D.C.

Hertz, G. and Stormo, G. (1999). Identifying dna and protein patterns with statis-

tically significant alignments of multiple sequences. Bioinformatics 15, 563–577.

IUPAC (1986). Nomenclature for incompletely specified bases in nucleic acid se-

quences. Recommendations 1984. Proceedings of the National Academy of Sciences

(USA) 83, 4–8.

Jensen, S., Liu, X., Zhou, Q., and Liu, J. (2004). Computational discovery of gene

regulatory binding motifs: a Bayesian perspective. Statistical Science 19, 188–

204.

146

Page 159: Statistical Techniques for Examining Gene Regulation

Kass, R. and Raftery, A. (1995). Bayes factors. Journal of the American Statistical

Association 90, 773–795.

Keich, U. and Pevzner, P. (2002). Finding motifs in the twilight zone. Bioinformat-

ics 18, 1374–1381.

Keles, S., van der Laan, M., Dudoit, S., Xing, B., and Eisen, M. (2003). Supervised

detection of regulatory motifs in dna sequences. Paper 131, U.C. Berkeley Di-

vision of Biostatistics.

Keles, S., van der Laan, M., and Eisen, M. (2002). Identification of regulatory

elements using a feature selection method. Bioinformatics 18, 1167–1175.

Kim, J.-H. and Chambliss, G. (1997). Contacts between Bacillus subtilis catabolite

regulatory protein CcpA and amyO target site. Nucleic Acids Research 25, 3490–

3496.

Kirkpatrick, S., Gelatt, C., and Vecchi, M. (1983). Optimization by simulated an-

nealing. Science 220, 671–680.

Kullback, S. and Leibler, R. (1951). On information and sufficiency. Ann. Math.

Stat. 22, 79–86.

Kunst, F., Ogasawara, N., Moszer, I., Albertini, A. M., Alloni, G., Azevedo, V.,

Bertero, M. G., Bessieres, P., Bolotin, A., Borchert, S., Borriss, R., Boursier, L.,

Brans, A., Braun, M., Brignell, S. C., Bron, S., Brouillet, S., Bruschi, C. V., Cald-

well, B., Capuano, V., Carter, N. M., Choi, S.-K., Codani, J.-J., Connerton, I. F.,

Cummings, N. J., Daniel, R. A., Denizot, F., Devine, K. M., Dusterhoft, A.,

Ehrlich, S. D., Emmerson, P. T., Entian, K. D., Errington, J., Fabret, C., Ferrari,

E., Foulger, D., Fritz, C., Fujita, M., Fujita, Y., Fuma, S., Galizzi, A., Galleron,

N., Ghim, S.-Y., Glaser, P., Goffeau, A., Golightly, E. J., Grandi, G., Guiseppi,

147

Page 160: Statistical Techniques for Examining Gene Regulation

G., Guy, B. J., Haga, K., Haiech, J., Harwood, C. R., Henaut, A., Hilbert, H.,

Holsappel, S., Hosono, S., Hullo, M.-F., Itaya, M., Jones, L., Joris, B., Kara-

mata, D., Kasahara, Y., Klaerr-Blanchard, M., Klein, C., Kobayashi, Y., Koetter,

P., Koningstein, G., Krogh, S., Kumano, M., Kurita, K., Lapidus, A., Lardinois,

S., Lauber, J., Lazarevic, V., Lee, S.-M., Levine, A., Liu, H., Masuda, S., Maul,

C., Mdigue, C., Medina, N., Mellado, R. P., Mizuno, M., Moestl, D., Nakai,

S., Noback, M., Noone, D., O’Reilly, M., Ogawa, K., Ogiwara, A., Oudega, B.,

Park, S.-H., Parro, V., Pohl, T. M., Portetelle, D., Porwollik, S., Prescott, A. M.,

Presecan, E., Pujic, P., Purnelle, B., Rapoport, G., Rey, M., Reynolds, S., Rieger,

M., Rivolta, C., Rocha, E., Roche, B., Rose, M., Sadaie, Y., Sato, T., Scanlan,

E., Schleich, S., Schroeter, R., Scoffone, F., Sekiguchi, J., Sekowska, A., Seror,

S. J., Serror, P., Shin, B.-S., Soldo, B., Sorokin, A., Tacconi, E., Takagi, T., Taka-

hashi, H., Takemaru, K., Takeuchi, M., Tamakoshi, A., Tanaka, T., Terpstra, P.,

Tognoni, A., Tosato, V., Uchiyama, S., Vandelbol, M., Vannier, F., Vassarotti,

A., Viari, A., Wambutt, R., Wedler, E., Wedler, H., Weitzenegger, T., Winters,

P., Wipat, A., Yamamoto, H., Yamane, K., Yasumoto, K., Yata, K., Yoshida, K.,

Yoshikawa, H.-F., Zumstein, E., Yoshikawa, H., and Danchin, A. (1997). The

complete genome sequence of the gram-positive bacterium Bacillus subtilis.

Nature 390, 249–256.

Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., and Wootton, J.

(1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multi-

ple alignment. Science 262, 208–214.

Lawrence, C. and Reilly, A. (1990). An expectation maximization (em) algo-

rithm for the identification and characterization of common sites in unaligned

biopolymer sequences. Proteins 7, 41–51.

148

Page 161: Statistical Techniques for Examining Gene Regulation

Liu, J. (1994). The collapsed gibbs sampler in bayesian computations with appli-

cations to a gene regulation problem. Journal of the American Statistical Associa-

tion 94, 958–966.

Liu, J. (1996). Nonparametric hierarchical Bayes via sequential imputations. An-

nals of Statistics 24, 911–930.

Liu, J., Neuwald, A., and Lawrence, C. (1995). Bayesian models for multiple

local sequence alignment and gibbs sampling strategies. Journal of the American

Statistical Association 90, 1156–1170.

Liu, J., Neuwald, A., and Lawrence, C. (1999). Markovian structures in biological

sequence alignments. Journal of the American Statistical Association 94, 1–15.

Liu, X., Brutlag, D., and Liu, J. (2001). Bioprospector: discovering conserved dna

motifs in upstream regulatory regions of co-expressed genes. Pacific Symposium

on Biocomputing 6, 127–138.

Liu, X., Brutlag, D., and Liu, J. (2002). An algorithm for finding protein-dna in-

teraction sites with applications to chromatin immunoprecipitation microarray

experiments. Nature Biotechnology 20, 835–839.

Lodish, H., Baltimore, D., Berk, A., Zipursky, S., and Matsudaira, P. amd Darnell,

J. (1995). Regulation of transcription initiation. In Molecular Cell Biology, 405–

481. Scientific American Books, Inc., 4th edn.

Makita, Y., Nakao, M., Ogasawara, N., and Nakai, K. (2004). DBTBS: database of

transcriptional regulation in bacillus subtilis and its contribution to compara-

tive genomics. Nucleic Acids Research 32, 75–77.

McCue, L., Thompson, W., Carmack, C., Ryan, M., Liu, J., Derbyshire, V., and

149

Page 162: Statistical Techniques for Examining Gene Regulation

C.E., L. (2001). Phylogenetic footprinting of transcription factor binding sites

in proteobacterial genomes. Nucleic Acids Research 29, 774–782.

Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infinite mixture models

based clustering of gene expression profiles. Bioinformatics 18, 1194–1206.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953).

Equation of state calculations by fast computing machines. Journal of Chemical

Physics 21, 1087–1092.

Metropolis, N. and Ulam, S. (1949). The Monte Carlo method. Journal of the

American Statistical Association 49, 335–341.

Molle, V., Fujita, M., Jensen, S., Liu, J., and Losick, R. (2003). The spo0a regulon

in Bacillus subtilis. Molecular Microbiology 50, 1683–1701.

Moszer, I., Glaser, P., and Danchin, A. (1995). Subtilist: a relational database for

the Bacillus subtilis genome. Microbiology 141, 261–268.

Nolling, J., Breton, G., Omelchenko, M., Makarova, K., Zeng, Q., Gibson, R., Lee,

H., Dubois, J., Qiu, D., Hitti, J., Wolf, Y., Tatusov, R., Sabathe, F., Doucette-

Stamm, L., Soucaille, P., Daly, M., Bennett, G., Koonin, E., and Smith, D.

(2001). Genome sequence and comparative analysis of the solvent-producing

bacterium clostridium acetobutylicum. Journal of Bacteriology 183, 4823–4838.

Pfahl, M. (1981). Characteristics of tight binding repressors of the lac operon.

Journal of Molecular Biology 147, 1–10.

Qin, Z. S., McCue, L. A., Thompson, W., Mayerhofer, L., Lawrence, C. E., and Liu,

J. S. (2003). Identification of co-regulated genes through bayesian clustering of

predicted regulatory binding sites. Nature Biotechnology 21, 435–439.

150

Page 163: Statistical Techniques for Examining Gene Regulation

Read, T., Peterson, S., Tourasse, N., Baillie, L., Paulsen, I., Nelson, K., Tettelin, H.,

Fouts, D., Eisen, J., Gill, S., Holtzapple, E., Okstad, O., Helgason, E., Rilstone,

J., Wu, M., Kolonay, J., Beanan, M., Dodson, R., Brinkac, L., Gwinn, M., DeBoy,

R., Madpu, R., Daugherty, S., Durkin, A., Haft, D., Nelson, W., Peterson, J.,

Pop, M., Khouri, H., Radune, D., B enton, J., Mahamoud, Y., Jiang, L., Hance,

I., Weidman, J., Berry, K., Plaut, R., Wolf, A., Watkins, K., Nierman, W., Hazen,

A., Cline, R., Redmond, C., Thwaite, J., White, O., Salzberg, S., Thomason, B.,

Friedlander, A., Koehler, T., Hanna, P., Kolsto, A., and Fraser, C. (2003). The

genome sequence of Bacillus anthracis ames and comparison to closely related

bacteria. Nature 423, 81–86.

Remm, M., Storm, C., and Sonnhammer, E. (2001). Automatic clustering of or-

thologs and in-paralogs from pairwise species comparisons. Journal of Molecu-

lar Biology 314, 1041–1052.

Roth, F., Hughes, J., Estep, P., and Church, G. (1998). Finding dna regulatory mo-

tifs within unaligned non-coding sequences clustered by whole-genome mrna

quantitation. Nature Biotechnology 16, 939–945.

Saxild, H., Brunstedt, K., Nielsen, K., Jarmer, H., and Nygaard, P. (2001). Def-

inition of the Bacillus subtilis PurR operator using genetic and bioinformatic

tools and expansion of the PurR regulon with glyA, guaC, pbuG, xpt-pbuX,

yqhZ-folD, and pbuO. Journal of Bacteriology 183, 6175–6183.

Schena, M., Shalon, D., Davis, R., and Brown, P. (1995). Quantitative monitoring

of gene expression patterns with a complementary dna microarray. Science 270,

467–470.

Schneider, T. D. and Stephens, R. M. (1990). Sequence logos: A new way to dis-

play consensus sequences. Nucleic Acids Research 18, 6097–6100.

151

Page 164: Statistical Techniques for Examining Gene Regulation

Shimizu, T., Ohtani, K., Hirakawa, H., Ohshima, K., Yamashita, A., Shiba, T., Oga-

sawara, N., Hattori, M., Kuhara, S., and Hayashi, H. (2002). Complete genome

sequence of clostridium perfringens, an anaerobic flesh-eater. Proceedings of the

National Academy of Sciences (USA) 99, 996–1001.

Shine, J. and Dalgarno, L. (1974). The 3′-terminal sequence of Escherichia coli

16s ribosomal rna: complementarity to nonsense triplets and ribosome binding

sites. Proceedings of the National Academy of Sciences (USA) 71, 1342–1346.

Sinha, S. and Tompa, M. (2000). A statistical method for finding transcription

factor binding sites. In Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, 344–354.

Stirling, J. (1730). Methodus differentialis. William Bowyer, London.

Stormo, G. and Hartzell, G. (1989). Identifying protein-binding sites from un-

aligned dna fragments. Proceedings of the National Academy of Sciences (USA) 86,

1183–1187.

Strauch, M., Webb, V., Spiegelman, G., and Hoch, J. (1990). The spo0a protein

of bacillus subtilis is a repressor of the abrb gene. Proceedings of the National

Academy of Sciences (USA) 87, 1801–1805.

Sun, G., Sharkova, E., Chesnut, R., Birkey, S., Duggan, M., Sorokin, A., Pujic, P.,

Ehrlich, S., and Hulett, F. (1996). Regulators of aerobic and anaerobic respira-

tion in Bacillus subtilis. Journal of Bacteriology 178, 1374–1385.

Takami, H., Nakasone, K., Takaki, Y., Maeno, G., Sasaki, R., Masui, N., Fuji, F., Hi-

rama, C., Nakamura, Y., Ogasawara, N., Kuhara, S., and Horikoshi, K. (2000).

Complete genome sequence of the alkaliphilic bacterium Bacillus halodurans

and genomic sequence comparison with Bacillus subtilis. Nucleic Acids Research

28, 4317–4331.

152

Page 165: Statistical Techniques for Examining Gene Regulation

Takami, H., Takaki, Y., and Uchiyama, I. (2002). Genome sequence of oceanobacil-

lus iheyensis isolated from the iheya ridge and its unexpected adaptive capa-

bilities to extreme environments. Nucleic Acids Research 30, 3927–3935.

Tanner, M. and Wong, W. (1987). The calculation of posterior distributions by

data augmentation. Journal of the American Statistical Association 82, 528–550.

Thompson, J., Higgins, D., and Gibson, T. (1994). CLUSTAL W: improving

the sensitivity of progressive multiple sequence alignment through sequence

weighting, position- specific gap penalties and weight matrix choice. Nucleic

Acids Research 22, 4673–80.

Tseng, G., Oh, M.-K., Liao, L. R. J., and Wong, W. (2001). Issues in cDNA microar-

ray analysis: quality filtering, channel normalization, models of variations and

assessment of gene effects. Nucleic Acids Research 29, 2549–2557.

van Helden, J., Andre, B., and Collado-Vides, J. (1998). Extracting regulatory

sites from the upstream region of yeast genes by computational analysis of

oligonucleotide frequencies. Journal of Molecular Biology 281, 827–842.

van Helden, J., Rios, A., and Collado-Vides, J. (2000). Discovering regulatory

elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids

Research 28, 1808–1818.

Velculescu, V., Zhang, L., Vogelstein, B., and Kinzler, K. (1995). Serial analysis of

gene expression. Science 270, 484–487.

Wang, T. and Stormo, G. (2003). Combining phylogenetic data with co-regulated

genes to identify regulatory motifs. Bioinformatics 19, 2369–2380.

Weickert, M. and Chambliss, G. (1990). Site-directed mutagenesis of a catabolite

153

Page 166: Statistical Techniques for Examining Gene Regulation

repression operator sequence in Bacillus subtilis. Proceedings of the National

Academy of Sciences (USA) 87, 6238–6242.

Werner, T. (1999). Models for prediction and recognition of eukaryotic promoters.

Mamm. Genome 10, 168–175.

Winterling, K., Chafin, D., Hayes, J. J., Sun, J., Levine, A., Yasbin, R., and

Woodgate, R. (1998). The Bacillus subtilis DinR binding site: Redefinition of

the consensus sequence. Journal of Bacteriology 180, 2201–2211.

Winterling, K., Levine, A., Yasbin, R., and Woodgate, R. (1997). Characterization

of DinR, the Bacillus subtilis SOS repressor. Journal of Bacteriology 179, 1698–

1703.

Xing, E., Wu, W., Jordan, M., and Karp, R. (2003). Logos: A modular bayesian

model for de novo motif detection. In IEEE Computer Society Bioinformatics Con-

ference, CSB2003.

154