MTAP: The Motif Tool Assessment Platform | SpringerLink

BioMed CentralBMC Bioinformatics

ss
Open AcceProceedingsMTAP: The Motif Tool Assessment PlatformDaniel Quest*1,2, Kathryn Dempsey1, Mohammad Shafiullah1, Dhundy Bastola1 and Hesham Ali1,2
Address: 1College of Information Science & Technology, University of Nebraska at Omaha, Omaha NE, USA and 2Department of Pathology and Microbiology, University of Nebraska Medical Center, Omaha NE. 68182-0694, USA

Email: Daniel Quest* - [email protected]; Kathryn Dempsey - [email protected]; Mohammad Shafiullah - [email protected]; Dhundy Bastola - [email protected]; Hesham Ali - [email protected]

* Corresponding author

AbstractBackground: In recent years, substantial effort has been applied to de novo regulatory motifdiscovery. At this time, more than 150 software tools exist to detect regulatory binding sites givena set of genomic sequences. As the number of software packages increases, it becomes moreimportant to identify the tools with the best performance characteristics for specific problemdomains. Identifying the correct tool is difficult because of the great variability in motif detectionsoftware. Consequently, many labs spend considerable effort testing methods to find one thatworks well in their problem of interest.

Results: In this work, we propose a method (MTAP) that substantially reduces the effort requiredto assess de novo regulatory motif discovery software. MTAP differs from previous attempts atregulatory motif assessment in that it automates motif discovery tool pipelines (something thattraditionally required many manual steps), automatically constructs orthologous upstreamsequences, and provides automated benchmarks for many popular tools. As a proof of concept, wehave run benchmarks over human, mouse, fly, yeast, E. coli and B. subtilis.

Conclusion: MTAP presents a new approach to the challenging problem of assessing regulatorymotif discovery methods. The most current version of MTAP can be downloaded from http://biobase.ist.unomaha.edu/

BackgroundThe regulation of gene expression in the cell is controlledby regulatory proteins and functional RNAs that interactspecifically with binding locations on the DNA. In thepast century, molecular biologists have developed many

innovative techniques to identify sites on the DNA thatare bound and regulated by functional RNA and proteins.As the number of sequenced organisms increases, tradi-tional techniques such as gel shift assays in combinationwith DNAse foot-printing assays can not keep pace with

from Fifth Annual MCBIOS Conference. Systems Biology: Bridging the OmicsOklahoma City, OK, USA. 23–24 February 2008

Published: 12 August 2008

BMC Bioinformatics 2008, 9(Suppl 9):S6 doi:10.1186/1471-2105-9-S9-S6

<supplement> <title> <p>Proceedings of the Fifth Annual MCBIOS Conference. Systems Biology: Bridging the Omics</p> </title> <editor>Jonathan D Wren (Senior Editor), Yuriy Gusev, Dawn Wilkins, Susan Bridges, Stephen Winters-Hilt and James Fuscoe</editor> <note>Proceedings</note> </supplement>

This article is available from: http://www.biomedcentral.com/1471-2105/9/S9/S6

© 2008 Quest et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 1 of 21(page number not for citation purposes)

http://www.biomedcentral.com/1471-2105/9/S9/S6

http://creativecommons.org/licenses/by/2.0

http://biobase.ist.unomaha.edu/

http://biobase.ist.unomaha.edu/

http://www.biomedcentral.com/

http://www.biomedcentral.com/info/about/charter/

BMC Bioinformatics 2008, 9(Suppl 9):S6 http://www.biomedcentral.com/1471-2105/9/S9/S6

the explosion of available sequences. While accurate, tra-ditional methods require a great deal of time and exper-tise. Traditional methods also require that the bindingenergy of the protein or RNA is great enough to maintaincontact though the course of the assay. Reductionistapproaches can be problematic when faced with non-spe-cific binding sites or with sites that require the formationof a complex with adjacent units in order to function. It isfor these reasons that complementary approaches on thecomputer have been developed in recent years to increasesuccess in discovering and annotating cis-regulatory mod-ules.

Computational discovery of functional subsequences hasbeen around for some time. In 1984, Galas, Eggert, andWaterman proposed a method for finding and character-izing binding positions for the σ 70 protein in E. coli pro-moter sequences [1]. In its most basic form, the problemis identifying sequence signatures, or motifs, that exist ina set of sequences that share the same property (e.g. pro-moters bound by protein A), but does not exist in a set ofsimilar sequences that does not have the same property(e.g. promoters not bound by protein A). Computationalapproaches have the advantage that they can reduce theguess work and cost associated with pure biochemicalapproaches. Consequently, many computational meth-ods such as Gibbs [2,3], MEME [4], Consensus [5,6], Bio-Prospector [7], AlignAce [8], Ann-Spec [9], MTAP: TheMotif Tool Assessment Platform.

Glam [10], and Weeder [11,12] have been developed inrecent years. Each of these methods employs differentalgorithmic insights and scoring functions. The scoringfunction and algorithm parameters impact the representa-tion and rank of the motifs discovered by the algorithm.In recent years, it has become clear that each of these com-putational methods has distinct advantages and that thebest performance is achieved by trying multiple methodsover the same data set. The bioinformatics community iscurrently seeking to simplify this situation by providingapproaches that integrate multiple methods into one toolsuch as BEST [13] and EMD [14]. Approaches that inte-grate additional sources of information have also emergedin recent years. For example, approaches such as PhyME[15], PhyloGibbs [16], and WeederH [17] integratesequences from regulatory regions of related organisms.Other approaches, such as REDUCE [18], integrate expres-sion values from gene expression arrays. In addition, theadvent of high density arrays and ChIP-chip technologyhas necessitated methods that integrate genome widebinding data [19].

While there are many advantages in integrating additionalsources of information, the steps required also add addi-tional complexity and cost. Each step along the integra-

tion pipeline opens a new question: what is the best wayto represent and integrate this new data? One has to won-der if additional data is always better in practice. It couldbe that for some regulatory modules, the additional dataalso includes additional noise making it more difficult torecover the binding sites. To make matters more con-founding, there currently exist more than 150 methodswith thousands of possible pre- and post-processing stepsand alternative runtime procedures (we have compiled alist of many popular methods here: http://biobase.ist.unomaha.edu). It is clear that benchmarking tech-nology is needed to map cis-regulatory motif discoverymethods to data where the method has the best perform-ance. In other words, given a method M and a set of co-regulated genes with regulatory sequences T = {t1, t2,...,tn}, find a mapping M → T with expected performanceover some threshold. A T that has no method over thethreshold is particularly interesting in that such data cantell us more about the current limitations of regulatorymotif discovery programs so that we can proposeimprovements.

In recent years, researchers have begun to solve this prob-lem by creating benchmarking datasets. In 2004, Tompaet al. published the first assessment of 13 regulatory motifdiscovery algorithms over Fly, Human, Mouse and Yeast[20]. This work was seminal in that it provided methodsfor comparing regulatory motif detection software andthat it benchmarked a large number of popular tools [13].In 2007, Sandve et al. improved benchmarking technol-ogy over the Tompa dataset using a machine learningapproach [21]. While these advances opened the impor-tant debate on how to benchmark algorithm perform-ance, it remains to be seen if a selection of so fewregulatory binding sites is large enough to form a repre-sentative set (The assessment included 8 sites from Saccha-romyces cerevisiae, 6 sites from Drosophila melanogaster, 12sites from Mus musculus, and 26 sites from Homo sapiens).In response, Klepper et al. proposed a larger test set andused it to evaluate several composite motif discovery tools[22]. This is an important first step in making benchmark-ing datasets more comprehensive. To date, the most com-prehensive test set was over the entire E. coli genome [23]using known transcription factors found in RegulonDB[24]. While bacterial transcription regulation is very dif-ferent than regulation in eukaryotes, the density of theRegulonDB annotation is greater than for any otherorganism and this coverage provides us with uniquebenchmarking opportunities. Unfortunately, this assess-ment only considered 5 tools and neglected to include thetool most successful in the Tompa Benchmark (i.e.Weeder). To date, there are no tools that can assess algo-rithms that incorporate additional sources of data such assequences from phylogentically related species.


http://biobase.ist.unomaha.edu

http://biobase.ist.unomaha.edu


Currently, cis-regulatory motif tools are assessed using thestatistics and methods from gene prediction. However, insilico gene annotation differs from cis-regulatory moduleannotation in that in gene prediction there exists anmRNA library for reference. Unlike regulatory bindingassays, mRNA library sequencing is high throughput andthus provides a large and diverse benchmarking datasetfor gene prediction tools. Consequently, gene predictionand cis-regulatory module prediction are very differentproblems. We currently have very few completely anno-tated regulatory modules and therefore motif predictiontools have very few cases to train. In addition, regulatorymodules are thought to evolve at a much faster rate thanthe genes they regulate [25]. If this theory is true, thediversity of regulatory regions is expected to be far greaterthan the coding regions they regulate. This means that theparameters, data, transcription module representationand algorithm are all important factors in evaluatingmotif detection tools.

In this work, we expand upon our parallel architecture forregulatory motif prediction [26] and propose a methodfor evaluation and discovery of data algorithm mappings.MTAP is a platform that allows one to vary the way scien-tists process data and algorithm parameters to fine tunethe representation of a regulatory module in method M.The goal is to discover the best possible mapping of M →D. Our platform also integrates phylogentically relatedregulatory sequences via computing downstreamorthologs from other species. It is clear that phylogenticfootprinting has great potential, however assessing meth-ods that incorporate sequence from closely related species

is not practical or accurate without automation (mainlybecause of the rapid availability of new organisms: as neworganism sequences become available, benchmarks needto change to take into account the new information). Forthese reasons, there is a place for adaptive and automaticcis-regulatory motif prediction and benchmarking.

Problem descriptionA cis-regulatory motif discovery pipeline, Mi, contains aseries of steps to separate transcription factor binding sitesfrom 'background noise'. Motif discovery pipelinesrequire collecting the positive example sequences, collect-ing negative example sequences, collecting relevant sup-porting data, running a separation filter to rank relevantsites based on an objective function and finally evaluatingor verifying putative sites within the context of each regu-latory module. Several factors outside of the pipeline itselfcontribute to the potential success of the discovery proc-ess, mainly: (1) the length of the sequences in the positiveand negative sets, (2) the number of sequences in the pos-itive and negative sets, (3) the distribution of transcrip-tion factor binding sites in the positive set, (4) the relativeentropy of the transcription factor binding motif, and (5)the fraction of null sequences (ones that do not contain abinding site) in the positive set [20]. It is likely there areadditional unknown variables that impact the accuracy ofan approach. For example, better background sequencemodels or sequences from closely related species (phylo-genetic footprinting) are known to affect performance.Our approach to finding the variables, parameters, andalgorithms that are best suited to annotating regulatory

The Motif Tool Assessment Platform (MTAP) enables a researcher to automate each of the steps in cis-regulatory motif dis-covery, evaluate tools, and propose changesFigure 1The Motif Tool Assessment Platform (MTAP) enables a researcher to automate each of the steps in cis-regulatory motif dis-covery, evaluate tools, and propose changes. MTAP is built as a platform where the data collection methods, motif discovery pipelines, and known binding sites are all modular components that can be edited and substituted to look at different aspects of the complex problem of cis-regulatory region annotation.



regions is an adaptive data collection and analysis plat-form (Figure 1).

Our hypothesis is that a platform that automates eachoperation in cis-regulatory motif discovery enables dis-covery of better methods for cis-regulatory motif predic-tion. Such a platform enables practitioners and users thenecessary flexibility to change characteristics of the dataand algorithm parameters, and to isolate and understandknown challenge cases. In addition, known methods withmany manual steps can be made more explicit. Presently,it is difficult to benchmark existing approaches becausethey have a great variety of algorithmic parameters thatallow the user to interactively optimize the discoveryprocess. While this approach is advantageous because itprovides flexibility in the discovery process, it alsopresents a challenge in algorithm benchmarking: how toformalize data collection, parameter fine tuning, andother intuitive steps taken by the most experienced usersof these methods. From a more practical perspective, onehas to wonder what evidence experienced practitionersuse to determine a set of algorithm refinement steps thatwill result in better performance over the new dataset (i.e.what qualities in the data indicate a usage procedure). Ourhypothesis is that the motif discovery process can begreatly improved if tool parameters are fine tuned basedon previous performance benchmarks. We present algo-rithm benchmarking over a set of known and related tran-scription factor binding sites (TFBS) as a method touncover the likely performance on the current dataset.This approach allows the refinement and modification ofmotif prediction pipelines. In particular, such a bench-marking approach allows for interactive modification ofknown TFBS in incomplete datasets, modification of theprocedures used in collecting positive, negative, and phy-logenatically related sequence samples, modification ofmethods for finding TFBS within the samples, and evenmodification of the methods used to score motif discov-ery pipelines.

Our central hypothesis is that by looking at all compo-nents of data and software along the cis-regulatory motifdiscovery process, we can refine our understanding of reg-ulation and discover pipelines with high accuracy. Theway we look at this complex problem is a result of how wecollect the data, how we build algorithms for discoveringregulatory motifs, the biological and manual processesand data structures used to create comprehensive annota-tions of cis-regulatory regions, and the methods wechoose to use in grading these motif discovery tools. Weview each of these components as parts that can andshould be modified as the specifics of the biological prob-lem at hand become clear. In this paper, we present a opensource prototype for cis-regulatory motif discovery algo-rithm evaluation.

ImplementationMTAP provides an automated method for ranking regula-tory motif detection algorithms. The underlying principleis to create a 'test', T, and an 'answer key', K = {k1, k2,...}.K is generated by parsing the raw database managementsystems (DBMS) of RegulonDB [24], DBTBS [27], PRO-DORIC [28], RegTransBase [29] and Transfac [30]. Theannotated genome, Gj, is then parsed for each regulatorybinding site annotated in K. Tk is a collection of sequencescorresponding to the surrounding regions of known bind-ing sites for transcription factor k in Gj. Tk is composed ofl instances of binding positions T1, T2,..., Tl in Gj as anno-tated by the database. To score a method, Mi, we constructan automated pipeline via a scripting language (Perl/Python) that runs the method. Each method has a differ-ent algorithmic approach and has different requirementsof pre- and post-processing. For each method, we con-struct background probability models, standardize inputdatasets, install program dependencies, and provide con-version and utility scripts so that each method can begraded fairly. We consider a fair test to be a test where eachprogram has access to the same information and mustmark the transcription factor binding positions in a stand-ard way. This standard marking is then assessed by com-paring the annotation of regulatory sites provided by thealgorithm with the known annotation in the database.

A schematic overview of our assessment method is pre-sented in Figure 2. Because databases contain many typesof binding positions (not just for transcription factors butalso for sigma factor binding sites, ncRNA interaction sitesand other binding proteins) these evaluations are an indi-cation of how well each algorithm can recover the bindingpositions for each element in the regulatory network andnot just transcription factor binding positions.

Generating upstream sequencesVarying upstream length allows us to explore the trade offbetween detecting long range interactions (large n) andhigh prediction accuracy (small n). MTAP contains twogenome substringing methods for generating upstreamsequences as shown in Figure 3. 'Completely-realistic' (cr)data generation is best suited to problems that convertgene lists to binding locations (such as a micro-array).'Semi-realistic' (sr) data generation is best suited to prob-lems that generate a set of sequences (for example whenwe know the regions of proposed binding sites). In mostcases 'sr' constructed data produces a more fair compari-son between pattern finding methods, but is not realisticwhen the tools are applied to a set of co-regulated loci.

Completely-realistic data generation is most often the nat-ural choice, but is limited because the upstream file maynot contain the motif for long range interactions. Also, the'cr' method may select a different downstream gene as part



of the data construction procedure. These two issues arerealistic in the cis-regulatory motif discovery process andare representative of current problems in cis-regulatorymotif discovery. This method is therefore representative ofcurrent methods used in constructing co-regulatedupstream sequences. Future work could be done to makeautomated upstream data generation more sensitive tothese types of issues.

The transformation function t(a, n) applies one of the sub-string operations (currently 'cr' or 'sr') to the reference

genome Gj to produce Tk. Tk is generated by selecting reg-ulator k and generating an upstream sequence uk, n for allinstances of k in Gj. Figure 4 provides a diagram of severalof the stages used to construct T = T1, T2,... for all instancesof k in Gj. Despite the fact that Tk is constructed by consid-ering only TF k, all of the other transcription factors in thedatabase are marked and scored if they fall withinsequence indices for any sequence in Tk.

Generating orthologous upstreamsAfter the upstream file, Uk, n, is constructed, MTAP collectsregulatory sequences from closely related genomes Gj+1,Gj+2,..., by using downstream orthologs. To do this, MTAPconstructs a list of all proteins in each genome Gj+1,Gj+2,.... Before the upstream is created, MTAP uses as orth-alog detection method to create an orthalog table, O (ourcurrent ortholog detection methods are best bi-directionalblast hits and RSD [31]). O contains a list of protein prod-ucts from Gj and Gj+1 and a confidence score Oc corre-sponding to the confidence in the orthology relationshipas determined by the ortholog detection method. For eachsequence in Uk, we select the nearest downstream gene, gi,

j ∈ Gj. MTAP then looks up any entries for gi, j ∈ O. If Oc >τ for some constant τ, MTAP appends a region n bpupstream of gj+1. If multiple entries exist for gi, j, MTAPappends the region upstream gj+1, i such that Oc is greatest.MTAP continues this procedure for all genomes Gj, Gj+1,....This procedure is then repeated to construct T1, T2,.... Anexample result is found in Figure 4D.

Several important points should be made about phylo-gentic foot-printing. First, the existence of a set of regula-tory binding sites exists in Gj does not imply the existenceof the same set of regulatory binding positions in Gj+1 inthe same positions. In our example in Figure 4, Tk does notcontain a binding position for K2 ∈ Gj+2. It is possible that

An overview of the MTAP running procedureFigure 2An overview of the MTAP running procedure. A. All known binding positions Gj in collected into upstream regions corre-sponding to each CDS in Gj. B. A transformation function t(a, n) creates a test for binding protein an bases upstream of each CDS (note that the transcription start site is often unknown or not correctly annotated). C. Background probability informa-tion for the entire genome is collected by comparing the upstream regions from the entire genome (or ∀ k) to the foreground regions selected by t(a, n). D. Pipeline p runs each step of the proposed method Mi. E. Mi creates a marking on the sequences in B that is evaluated against all marked transcription factor binding positions in B to score the performance of Mi in recovering binding sites for transcription factor a. This is then repeated for transcription factors b and c.

Demonstration of two different methods for obtaining the sequence n bases surrounding the known regulatory binding siteFigure 3Demonstration of two different methods for obtaining the sequence n bases surrounding the known regulatory binding site. In the figure 'cr' refers to completely realistic generation where we find the closest downstream CDS location in the genome and extract n bases upstream from that CDS. 'sr' refers to extracting n/2 bases upstream and downstream of the center of the binding site. We assume that programs reverse complement sequences appropriately as needed by the method as part of the discovery procedure but provide upstream sequences on the positive strand relative to the downstream CDS. The known motif 'blackfile' sequence is represented by a black line over the binding site k that refers to the region that is bound by the transcription factor. The red regions in the diagram illustrate the actual binding posi-tions known for motif k for those nucleotides that interact with the regulatory protein.



our criteria to classify orthologs is too stringent to find anactual ortholog in Gj+1. Notice that in our example Tl con-tains no upstream corresponding to g1 in Gj+2. In ourexample, this occurred because the homology thresholdcriteria excluded g2 from Gj+2. Again, these problems willexist in any current automated pipeline used for regula-tory motif detection. Some of these issues can be resolvedwithin the algorithms themselves: many algorithms incor-porate the hypothesis that zero or many instances of thecis-regulatory binding motif exists within the upstreamsequence. However, the prospect is very real that we couldintegrate sequences that are related but do not containbinding sites for transcription factor k. Currently, we havevery little understanding of the relationship between theevolution of regulatory sequences and coding sequences.It is quite possible that phylogenetically related sequencesintroduce additional 'noise' with very little signal in regu-latory sequences for genes that do not provide criticalfunctions. These complex relationships warrant an in-depth study of regulatory evolution as it relates to codingsequence evolution – which is the subject of future work.For now, we wish to use this approach to benchmark sin-gle genome methods as they compare to multiple genomebased methods. To our knowledge, MTAP is the firstmethod that allows additional sources of information

(sequence conservation) to be automatically integrated aspart of the evaluation process.

Constructing background sequencesOnce Tk has been constructed for all K, we present threepossible background sequence files to motif discoverypipelines: (1) all upstream sequences of length n in Gj, (2)all upstream sequences of length n that exist in some Tk,and (3) a fasta formated sequence of Gj. Programs thatincorporate phylogeny have different backgroundsequence requirements. Such programs require a back-ground phylogenetic tree constructed from extracting 16SrRNA from each genome in the study. Some programsrequire pre-processing steps to calculate an HMM or GCcontent of the test sequences. Other programs require abackground probability distribution of all upstreamsequences in Gj. We compute each of these requirementsfor each pipeline in our pre-processing stage and providethe files to each pipeline.

Compensating for unknown TFBSUnknown TFBS that exist in Tk complicate the assessmentprocess. Tools could predict true sites that are currentlyunknown or un-annotated. To exclude unknown sitesfrom Tk, we construct a Markov chain, MCm, (of depth m).

An overview of the pre-processing steps taken in MTAP: AFigure 4An overview of the pre-processing steps taken in MTAP: A. The source genome Gj with genes g1, g2, g3 with known binding sites k1, k2, k3, k4, k5 (found clockwise around the genome from the origin); K1 and K2 are the known regulatory proteins that bind to the transcription factors B. Two additional genomes, Gj+1 and Gj+2, that are phylogentically closely related to Gj. Gj+1 and Gj+2 also have binding sites C. The background phylogenetic tree constructed via extracting 16S rRNA D. Two tests T1 and T2 corresponding to known positions from K1 and K2.



For each sequence, Si, in Tk, si is sampled with MCm. Wethen generate an alternative sequence, S'i, through a ran-dom walk through the sates in MCm. For all TFBS in K thatoverlap si, we insert the true TFBS sequence from si into s'i.In this way, we use MCm to 'scramble' the upstreamsequences in the test and then re-insert the known motifsback into the sequences at the same positions found in thesource genome. It is also informative to have instances oftrue negative sequences produced via MCm, so we produceinstances of Tk with no inserted TFBS. Orthologousupstream sequences need not be scrambled if they are notscored. These synthetic sequences serve to make a ' morefair' test in those cases where very few of the known motifsare marked. However, such sequences may not correctlyincorporate the biological process that generates truesequences. To accommodate this, we also insert TFBS intoa random sequence from the set of all upstream sequencesin Gj.

Constructing motif discovery pipelinesIn constructing motif discovery pipelines, our intention isto include pipelines for as many tools as possible. How-ever, building parsers, installing scripts, and optimizing apre-processing, post-processing, and runtime pipeline foreach of over 150 programs is extremely labor intensive.Our main obstacle is in finding executables that can runon a variety of architectures in a Linux cluster. In manycases, we attempted to contact authors to arrange a port oftheir tool to a Linux cluster. Many authors are extremelyhelpful and we would like to thank them for the adviceand guidance of how to use and port their tool. However,we realized that even if we have a stand-alone version of amethod working on our architecture, MTAP users will stillneed to install many of the tools directly from the authors(if we do not have the legal authority to distribute thetool). Our strategy for including a tool is as follows:

• Include tools that are the most popular.

• Include tools that present different and novel scoringfunctions for differentiating background sequences fromtranscription factor binding sites.

• Include tools that integrate diverse types of informationfrom public sources.

• Do not include tools that do not have a downloadableexecutable or can not be compiled locally, as such toolscan not easily be run thousands of times for assessmentpurposes.

• Do not include tools that do not have support for differ-ent architectures and operating systems (tools we can notrun on our computers).

• As we can not redistribute tools with strict licensingagreements or 'abandonware', inclusion of such tools isleft to the user community.

Our platform is provided open-source for practitionerswho would like to develop their own pipelines to inte-grate their tool into MTAP. This approach has an advan-tage in that the developer of the tool is also the developerof the pipeline for evaluating the tool. This could providean edge as the tool developer will understand the limita-tions and usage procedures best. We provide our assess-ment pipelines within our framework for open accessreview and improvement.

In this work, we developed pipelines for AlignACE [8],Ann-Spec [9], ELPH [32], Gibbs [3], Glam [10], MEME[33], PhyloGibbs [16], PhyME [15], and Weeder [11,12].For each of these tools in the Tompa et. al benchmark, wedeveloped an automated system that was as close to thespirit of the procedure used by the algorithm practitionersas practical. For example, our AlignACE pipeline containsa pre-processing script for calculating GC content of theupstream file and a postprocessing script to mask lowcomplexity repeats using RepeatMasker. The pipeline thenparses the raw AlignACE output which results in a rankedlist of predicted transcription factor binding sites sortedvia the AlignACE MAP score. The pipeline accepts thehighest c scoring motifs by MAP score and then deter-mines confidence by calculating the group specificityscore as provided by CompareACE. A high group specifi-city score and MAP score indicates a high degree of confi-dence in the prediction provided by the AlignACEpipeline. For those tools that are not in the Tompa assess-ment, we carefully followed the usage guides and refinedour pipelines using MTAP to produce the best results pos-sible.

For some methods, the code is not available and themethod paper does not make it clear if certain data prep-aration procedures can be accommodated by the method.For example, some methods account for upstreamsequences that are reverse complemented while others donot. Some methods account for zero, one, or many motifinstances on one strand while others do not. Some meth-ods allow for variable motif widths, while others requirean explicit window (thus we need to write a procedure torun the algorithm over all reasonable multiple widths andthen rank and combine the results from multiple runs).We assume that motif discovery pipelines are sufficientlyrobust to account for these technical details and weattempted to make our pipelines robust in this way. How-ever, in constructing these pipelines in this way, we mayhave overlooked some aspects of the algorithmicapproach that would make our pipelines not representa-tive of the original author intent (i.e. our pipeline may not



be representative of the best performance possible byexpert manual application of the method). We addressedthis issue in two ways. First, we built pipelines for multi-ple implementations of similar approaches (e.g. Gibbsand ELPH). Second, we refined each pipeline based on theobjective function found in the literature and the bench-marks obtained via MTAP. To guard against over-fitting ofa pipeline to a certain dataset, we did not allow modifica-tion of the code or implementation of any procedures notrecommended explicitly by the authors. We performedbenchmarks over the Tompa assessment datasets, Regu-lonDB [24], and DBTBS [27] and selected transcriptionfactors at random to validate if the proposed changeimproved results. As the search space is large, there mayexist a set of pre-processing, post-processing, and runtimesteps that may improve the performance of our currentpipelines. If used in this way, MTAP provides the frame-work for method developers and method users to formal-ize and improve motif discovery pipelines.

Pipeline evaluationThree levels of specificity can be considered when evaluat-ing the accuracy of Mi: (1) Mi correctly predicts the regionbound by some transcription factor k, (2) Mi correctly pre-dicts the binding site of k, and (3) Mi correctly predicts theamino acids in TF that interact with specific nucleotideswithin the regulatory region. Mi may also correctly predictthe type and strength of the interactions. Predictions ofType 1 are analogous to a gel shift assay in that we canidentify a part of the regulatory region bound by a protein.Predictions of Type 2 are analogous to a DNA foot-print-ing assay. Predictions of Type 2 are more specific in thateach region of interaction between k and the DNA is iden-tified. For example, if a transcription factor is a dimer, twointeraction sites are identified by predictions of Type 2whereas only one interaction site is identified by a predic-tion of Type 1 (a site that contains both sites in a Type 2prediction). The third type of prediction is analogous todetermining the crystal structure interaction points of thetranscription factor – DNA complex. While the specificityand information provided by the third Type of predictionis far greater than annotations of Type 1 or Type 2, suchdata is difficult to obtain and few methods make predic-tions at this level. We therefore generate two annotationfiles: a 'redfile' and a 'blackfile' corresponding to the sitelevel and region level respectively. To generate anupstream file, we use the blackfile annotation. To assessthe performance of an algorithm we evaluate the predic-tions versus the redfile annotation.

The redfile annotation RED = I1, I2,... contains a set of

intervals Ik = (u1, v1), (u2, v2),... that correspond to the start

(uh) and stop (vh) positions in Gj corresponding to bind-

ing locations for transcription factor k. To compare the

redfile annotation to the motif tool predictions, Ui

(upstream sequence i in Tk) is used as a scaffold to place

annotation elements. To annotate the known sites at the

nucleotide level, we mark each base, j, in Ui if uh ≤ j ≤ vh ∀

K. (vh ≤ j ≤ uh ∀ K if Ui is on the opposite strand) The pre-

dicted binding locations, , predicted by Mi are

parsed and translated into a ranked list of predicted bind-

ing sites, each of the form . The ranked

list contains elements TFL1, TFL2,... sorted according to

the confidence that Mi has in the prediction accuracy.

MTAP accepts the top c elements from the rank list forevaluation and inserts them onto the upstream scaffold.MTAP then marks each position, j, as a predicted nucleo-tide if there exists some predicted binding site, B, thatoverlaps it. At the nucleotide level, we collect the overlapstatistics shown in Table 1.

The first four core statistics (nTP-nucleotide true positives,nFN-nucleotide false negatives, nFP-nucleotide false posi-tives, and nTN-nucleotide true negatives) are collected bysumming the number of each of the occurrences shown inTable 1 in Tk. The site level statistics (sTP-site true posi-

tives, sFN-site false negatives, and sFP-site false positives)are the final three core statistics provided by MTAP. A sitelevel statistic encompasses the idea that a group of adja-cent nucleotides marked as binding positions for tran-scription factor k by Mi is representative of a binding site

annotation. A site is annotated as a true positive if

overlaps Ix ∀ x by more than τ percent of Ix. For example,

consider two overlapping sites, a known site Ix of 12 con-

secutive nucleotides and a predicted site B of 8 consecu-

tive nucleotides. Given τ = .25 If Ix shares 3 nucleotides

with B it is annotated as a sTP. If Ix shares only 2 nucleo-

tides with B it is annotated as a sFP. Once a site, Ix, over-

laps a prediction, it can not be annotated as a sFN. All

Bk u vu, ,

B Bk u vu

k u vu

1 1 1 1 2 2, , , , ,...,

Bk u vui, ,

Table 1: Nucleotide Level Statistics. ui, j represents the upstream regulatory sequence j at position i, RED is the set of annotated database positions found in the 'redfile', B represents a binding site predicted by method M.

Statistic Definition

nTP If ui, j exists in both RED and B.nFN If ui, j exists in RED and not in B.nFP If ui, j exists in B and not in RED.nTN If ui, j exists in neither RED nor B.



remaining sites, I1, I2,..., Ix, that do not have an overlap-

ping prediction for tool Mi are annotated as sFN. The site

level statistics are in Table 2.

Tompa et. al set τ to 25%. The logic was that such an over-lap makes discovery and refinement of the TFBS possiblein the lab. In large scale genome annotations, we findsuch a threshold to be too strict. For example, many TFBSare not annotated specific enough in databases such asRegulonDB. This results in TFBS that exist in K that can belarge. Such annotations are actually representative of theregion of binding of the transcription factor (a blackfileannotation) and not the binding sites (a redfile annota-tion). Because many motif discovery programs have fixedmotif widths (e.g. 8), a threshold of 25\% would not besufficient to mark a sTP (e.g. a site of width 60 and a siteprediction of length 8). We could choose to rank site levelmotifs based on a percentage of the prediction widthinstead of the regulatory motif width, but this would givean unfair advantage to methods that predict larger sites.Our current approach is to set τ equal to the maximumannotated site width in the dataset divided by the mini-mum expected motif width predicted by our suite of pro-grams (usually 8) times 25%. Our logic is that a degree ofoverlap indicates that computational and biologicalrefinement of site predictions can still find the site. Thatsaid, manual curation of datasets to ensure binding siteannotations rather than region annotations is necessary.Standards across regulatory binding site databases todelineate each of the three levels of biological data wouldgreatly increase evaluation accuracy of motif discoverytools. Also, as not all tools provide a mapping for a set ofsites to a putative regulator, these statistics are currentlynot reflective of which regulator is annotated by a sitelevel prediction.

Following Tompa et. al we define the statistics in Figure 5to perform the assessment.

These statistics enable us to determine the quality of algo-rithm predictions and therefore infer which tools may bebest suited to discover unknown motifs under similar sit-uations. MTAP evaluates each of these statistics by com-paring the predictions found in the program output witha set of known binding positions of the same type. Foreach instance found in the known dataset, a motif predic-tion tool is run and then parsed. The prediction is com-

pared to the known binding site via the seven key statisticsin Table 1 and Table 2. These statistics will then be used toassess the overall performance of the algorithm. MTAPproduces an output file for each regulatory binding motifin K. Users can sum the raw statistics in these files as theysee fit. For this paper, we collect the seven raw perform-ance statistics for each motif in the assessment and thensum these values as if the collection of runs was actuallyone run.

In some cases, such sums do not graphically represent thecontribution of each element in the set to the total per-formance score. To address this, we also developed agraph that iterates over all runs in the test (T1, T2,..., Tk)The graph produced is a modified receiver operating char-acteristic (ROC) curve that combines statistics from mul-tiple runs [34]. We use the following algorithm to produceour ROC graphs:

for each motif in the dataset:

input xTP, xFP, nFN, nTN, sTP, sFP, sFN

P < -calculate xSP and xSN for this motif

for all motifs in the dataset:

totalSP = sum(xSP)

totalSN = sum(xSN)

sort(P.xSN)

for all ties in P.xSN; sort(P.xSP)

for i in P:

plot(xSN/totalSN, xSP/totalSP)

This produces a curve that travels straight up and then tothe right if all motifs in the dataset are predicted correctly.The curve will travel straight to the right and then up ifvery few of the motifs are predicted correctly. Finally, if thetool predicts sites correctly as often as it predicts sitesincorrectly, a line along the diagonal of the ROC graphwill be plotted. However, unlike machine learning algo-rithms where such a graph is often no better than a ran-dom classification of sites, using this method there issome value in graphs along the diagonal because we onlyallow 3 site predictions to be placed on the scaffold.

Known motif databasesIn MTAP, we have implemented interfaces to each of thefollowing databases: RegulonDB [24], DBTBS [27], PRO-DORIC [28], RegTransBase [29] and TRANSFAC [30] (via

Table 2: Site Level Statistics.

Statistic Definition

sTP Number of known sites overlapped by predicted sites.sFN Number of known sites not overlapped by predicted sites.sFP Number of predicted sites not overlapped by known sites.



the Tompa Benchmark Set). Each of these databases wasconstructed with different goals and none were builtexplicitly to evaluate motif prediction tools. Therefore,some database cleaning is required to make these datasetsmore appropriate for algorithm assessment. We providetwo procedures: (1) We require that the known bindingsite occur in at least nl locations in Gj (we set this to 3 forthe results below). MTAP will not create upstream files forTFBS that do not meet this threshold. (2) We provide ascript to analyze multiple sequence alignments and infor-mation content of a set of known motifs. We do notrequire that the TFBS have a consensus sequence as anno-tated by the database – our logic being that users can elim-inate sites without a strong consensus from their analysisby using our information content script and keeping onlythose sites that exceed a set threshold.

Duplicate instances of the same upstream file are elimi-nated to prevent bias before they are processed by the pro-grams (however the originals are stored for those userswishing to do further analysis). MTAP does not acceptTFBS that do not contain a start and end position in Gj.For TFBS that are inconsistent, users can eliminate thesites and re-score the TFBS. Our primary goal in collectingdata is to provide automated methods that can improve asthe datasets become more comprehensive. At this time,both PRODORIC [28] and RegTransBase [29] have veryfew TFBS with enough binding positions in the samegenome Gj for us to provide a comprehensive benchmark(the goal of these databases is more focused on trackingconversation of TFBS across species). Though labor inten-sive, high annotation density in datasets such as these pro-vides the greatest insight into evaluating computationalmethods that predict transcription factor binding sites.

ResultsIn this section we will provide illustrative examples ofhow our benchmarking technology can be used to evalu-ate several important parameters in cis-regulatory motifdiscovery. To construct a series of tests for evaluation, weextracted 2247 known transcription factor binding posi-tions from RegulonDB [24] corresponding to known posi-tions from Escherichia coli K12, and 680 known motifsfrom DBTBS [27] corresponding to positions from Bacillussubtilis. Then, we extracted 522 positions from Eukaryotedataset developed by Tompa et al.[20]. Because Regu-lonDB has the most comprehensive coverage, we usedRegulonDB to assess the impact of including two addi-tional genomes (E. coli strain W3110, and E. coli strainUTI89 in addition to E. coli K12). We used our approachto include regulatory regions from these strains in evalu-ating PhyME and PhyloGibbs. Our primary goal is toshow how MTAP can be used to evaluate each of the cen-tral questions in cis-regulatory motif discovery. To illus-trate the capabilities of MTAP we will use it to evaluate theimpact of four of the key factors for regulatory motif dis-covery introduced earlier in this paper, mainly: (1) thelength of the sequences in the positive and negative sets,(2) the number of sequences in the positive and negativesets, (3) the distribution of transcription factor bindingsites in the positive set, (4) the relative entropy of the tran-scription factor binding motif. Through these illustrativeexamples we intend to show the exploratory power ofMTAP towards discovering M → T mappings.

Benchmark automationOur first goal was to illustrate that our system of automa-tion can provide similar results to manual runs. To dothis, we downloaded the benchmark results from the

Statistics for Evaluating Motif Prediction Algorithm ImplementationsFigure 5Statistics for Evaluating Motif Prediction Algorithm Implementations.



Tompa assessment and compared these results to resultsobtained from our pipelines for AlignACE, Ann-Spec,Glam, MEME, and Weeder. Results for nSn, nSp, and sSnfor our platform versus the Tompa benchmarks are shownin Figure 6. Overall, sensitivity over our dataset is higher,and specificity suffers slightly. There are a few reasons forthis. First, occasionally experts in the Tompa assessmentpick a TFBS that is not the highest scoring motif. We thinkthey do this because of their experience with known TFBSin Transfac. Also, MTAP allows the top c (three in thiscase) predictions to be scored as suggested as an improve-ment by Tompa to increase sensitivity. The original assess-ment only allowed the top prediction to be scored. Inmany cases, high specificity is obtained by the tool notmaking a prediction.

Consequently, we feel pipelines that increase sensitivity ata low cost to specificity provide a good trade-off. Overallperformance is similar over this dataset. This provides evi-dence that our automation pipelines work well relative tomanual runs by experts. However, there are still manythings better understood by the experts and we continueto refine our pipelines as more information becomesavailable.

Automated assessmentWe next used MTAP to produce a benchmark over the sitesannotated in RegulonDB. We chose to run MTAP in 'cr'mode over upstream sequences of 400 bp (shown in Fig-ure 7). Overall specificity over this dataset is quite high.This data indicates that tools such as MEME and Weederachieve higher sensitivity (at both the site level andnucleotide level) without substantial losses to specificityon E. coli TFBS. Summed over all TFBS, nCC ranged from-0.03 for PhyloGibbs to 0.06 for MEME. ELPH and Ann-Spec showed the least correlation in this test with a nCCvalue of 0.01 each. Overall correlation is extremely weakfor any tool in the test.

Positive predictive value ranged from 0.13 (PhyME) to0.28 (Glam) at the nucleotide level and 0.1 (PhyME) to0.35 (Glam) at the site level. The low values for nPPV andsPPV for PhyME can be attributed to the large number ofsites that PhyME did not predict any binding positions.This behaviour is most likely explained by the largeamount of sequence conservation found between theupstream regions in these different strains of E. coli. It islikely that the multiple sequence alignment stepemployed by PhyME did not encounter enough sequencedivergence in this test set to distinguish between regula-tory binding positions and background sequence conser-vation. The PhyloGibbs algorithm did not appear toencounter the same difficulties. However, PhyloGibbs didnot appear to gain a substantial performance gain overWeeder even though it had regulatory sequences from

related strains and Weeder did not. AlignACE and Ann-Spec differ from the other programs in that they sacrificespecificity slightly for increased sensitivity. Over many ofthe regulatory regions both AlignACE and Ann-Spec pro-vided correct predictions somewhere in the list of pre-dicted sites when other tools did not.

Figure 7 provides additional evidence that implementa-tion and user parameters are not as important as the algo-rithm approach and discrimination function. In thisgraph, the two programs (ELPH and Gibbs) that useweight matrices for motif discrimination and gibbs sam-pling as the algorithmic optimization procedure havealmost identical performance profiles despite the fact thatthe parameters provided to each of these tools are quitedifferent in our pipelines.

In the original Tompa Assessment, Weeder had more dis-crimination power than other approaches. While stillquite good, Weeder does not appear to have the sameadvantages in this test. We feel that this gives further evi-dence that the organism and type of regulatory mecha-nism greatly impact the expected performance of a tool.

Number of upstream sequencesAs the upstream size increases relative to the size of thetranscription factor binding sites, the background signalsfound in the dataset also increase. This makes discrimina-tion of true transcription factor binding positions moredefault as the number of regulatory regions and length ofeach region increases. We wanted to explore the relation-ship over known transcription factor binding sites. To dis-cover the relationship between |Tl, i| and |Σi Tl, i| we set |Tl,

i| = 400 bp and ran pipelines for AlignACE, AnnSpec, Elph,Glam, Gibbs, MEME, PhyME, Phylogibbs, and Weeder(the notation |x| means the sequence length of x). Figure8 shows nCC versus |Σi Tl, i|. As the number of sequencesincreases, we expect the absolute value of nCC for a toolto increase once the number of co-regulated sequencescontaining the same signal surpasses some threshold.Once the signal is detected and we continue to increasethe number of sequences in the upstream dataset, theabsolute value of nCC should decrease as the 'noise' intro-duced for each added sequence far exceeds the signals. Ifthis is the case for the regulatory regions in E. coli K12,then the data indicates that 3 co-regulated sequences pro-vides enough signal to be detected by many of the toolstested in this assessment. While nCC is quite low; the per-formance over this test set does not indicate that regula-tory binding sites are more easily detected if we have moreinstances of them (as would be suggested by statisticallearning theory; e.g. if we have more recorded instances ofa phrase uttered by more people an HMM can detect thephrase more easily). It could be that global regulators thathave more binding positions over the genome also have




8 performance statistics for 9 motif prediction pipelines generated by MTAP over RegulonDB 400 bp upstream regulatory sequencesFigure 68 performance statistics for 9 motif prediction pipelines generated by MTAP over RegulonDB 400 bp upstream regulatory sequences.


more variability in their binding sites. This makes sense ifone considers that each instance of a global regulatorybinding site must have a different binding energy to con-trol each of the many genes regulated at different rates.The regulatory binding positions with the most occur-rences indicate a higher nCC averaged over the motifdetection tools. This could be explained by a superior abil-

ity of the algorithms to recognize transcription factorbinding sites once the number of binding instances islarge relative to the length of Gj.

Figures 9 and 10 show the site level sensitivity and nucleo-tide specificity for AlignACE, AnnSpec, Elph, Glam,Gibbs, MEME, PhyME, Phylogibbs, and Weeder with |tk, i|

8 performance statistics for 9 motif prediction pipelines generated by MTAP over RegulonDB 400 bp upstream regulatory sequencesFigure 78 performance statistics for 9 motif prediction pipelines generated by MTAP over RegulonDB 400 bp upstream regulatory sequences.




nCC versus total number of base pairs in Tk over RegulonDB 400 bp sequencesFigure 8nCC versus total number of base pairs in Tk over RegulonDB 400 bp sequences.

Site sensitivity versus the number of sequences in the upstream file for each Tk in RegulonDB with more than 2 unique binding positions in GjFigure 9Site sensitivity versus the number of sequences in the upstream file for each Tk in RegulonDB with more than 2 unique binding positions in Gj.


= 400 bp. Overall, specificity of these tools maintains aconsistent level or increases as the number of sequences inTk increase. The inverse is true for sensitivity. As thenumber of sequences increase, increased instances of reg-ulatory signal does not lead to increased tool sensitivity.This data indicates that as the number of regulatory sig-nals in the foreground increases linearly, the background'noise' increases quadratically. High specificity is mostlikely the result of increased reluctance on the part of toolsto make predictions as the number of sequences increases.As the number of sequences increases, the number of co-occuring motif instances also increases. This makes itmore likely that multiple occurrences of motif for arelated transcription factor may occur in the sameupstream set. This motif cross-talk may play a significantroll in defeating current detection methods.

Length of upstream sequencesTo further explore the impact of the size of Tk, we usedMTAP to extract 500 bp and 200 bp upstream regionsfrom motifs found in RegulonDB and ran pipelines foreach of the tools. Most all of the tools did not show anysignificant correlation between size of Tk and predictionperformance (data not shown). We believe that this indi-cates a problem with these classification algorithms overthe datasets and not to variation in the size of Tk. It could

be that these algorithms only classify certain subclasses ofregulatory binding profiles accurately. Here we present theresults from Weeder to demonstrate the impact of varyinglength of the upstream file. For Weeder, there is a slightimpact on performance if the size of Tk is varied. To dem-onstrate this point, we calculated sensitivity and specifi-city over the 10 largest and smallest upstream files at 200bp and 500 bp, respectively (Tables 3 and 4). This datashows that Weeder nucleotide specificity is not greatlyimpaired by the size of the dataset in these tests. However,we do see a marked decrease in sensitivity both at thenucleotide and site level given larger datasets. The largestdataset has an average sSp of 0.41 while the smallest data-set has a average sSp 0.58 – a substantial difference. WhilenSn increases as a trend from smaller to larger tests, thepredicted window size is on average much smaller thanthe motif size resulting in many missed predicted nucleo-tides. Weeder predicts individual transcription factorbinding locations can be detected fairly well up to 33066bp over this dataset. Increasing the dataset size further pre-cipitates a steady drop in sensitivity until predictions areno longer useful.

To further understand the impact of upstream length onmotif detection performance. For each tool we ran MTAPand generated ROC graphs for lengths 20 bp, 50 bp, 100

Nucleotide specificity versus the number of sequences in the upstream file for each Tk in RegulonDB with more than 2 unique binding positions in GjFigure 10Nucleotide specificity versus the number of sequences in the upstream file for each Tk in RegulonDB with more than 2 unique binding positions in Gj.



bp, 200 bp, 300 bp, 400 bp, 500 bp, and 800 bp upstreamof the gene for DBTBS and RegulonDB. To understand theroll of data generation methods, we generated both com-pletely-realistic ('cr') and semi-realistic ('sr') data. Here weprovide the results for ANN-Spec in Figure 11 which isillustrative of these results. The most important character-istic of note is between the performance curves of DBTBS(Figure 11a and 11b) and RegulonDB (Figure 11c and11d). Figure 11c and 11d are more smooth than Figures11a and 11b. This is because the number of sites in theregulonDB dataset is much greater than the number ofsites annotated in DBTBS.

Commonly, researchers would like to know what motifdiscovery program is best suited to a particular organism.These findings suggest that this question can not beaddressed currently because of the different amounts ofcoverage found in each dataset. If the coverage of TFBSover the genome were greater in DBTBS, the curves in Fig-ure 11 would show an accurate comparison of the sensi-tivity-specificity tradeoff in running the Ann-Spec pipelineon each organism. Figure 11a and 11c refer to semi-realis-tic data generation (the known binding site is in the mid-dle of the upstream sequence). As the window size

increases, there is a precipitous drop in Ann-Spec's abilityto correctly recover the site. High nSn and nSp areexpected at 20 bp 'sr' as most any prediction will overlapthe true TFBS. As the window size increases, we expect theperformance to remain the same for tools with high recov-ery rate, but performance should decrease for tools thathave poor accuracy. Figure 11c shows a performance dropin nSn and nSp as the length of the upstream sequenceincreases.

Ann-Spec does appear to recover many sites regardless ofthe upstream length as noted by the close cluster of per-formance graphs on the right hand side of Figure 11c.Although similar, 'cr' generated data appears to havehigher recovery rates for short windows. We believe thatthese recovery rates are most likely related to the relation-ship between the location of the signal for the TFBS andthe location of the signal for the σ70 binding site – butthis requires more exploration.

Site distributionAlgorithm practitioners commonly work from theassumption that TFBS have more information than thesurrounding sequence. If this is so, the total number of

Table 3: Sensitivity and specificity for the 10 largest and smallest datasets from a Weeder run over all of RegulonDB at 500 bp.

10 Smallest Motif DatasetsSize nSn nSp sSp

1503 0.35 0.81 0.51503 0.12 0.92 0.331503 0.26 0.82 0.671503 0.24 0.79 0.51503 0.34 0.81 0.671503 0.23 0.61 0.672004 0.32 0.62 0.732004 0.14 0.72 0.52004 0.31 0.78 0.52004 0.19 0.74 0.5

Avg: 0.25 0.76 0.5610 Largest Motif Datasets

Size nSn nSp sSp

7515 0.1 0.79 0.318016 0.4 0.84 0.7510521 0.34 0.73 0.5414028 0.25 0.91 0.4416533 0.18 0.89 0.3417535 0.15 0.87 0.4819038 0.11 0.92 0.2422044 0.1 0.87 0.2933066 0.13 0.83 0.4363126 0.09 0.93 0.28

Avg: 0.18 0.86 0.41

Table 4: Sensitivity and specificity for the 10 largest and smallest datasets from a Weeder run over all of RegulonDB at 200 bp.

10 Smallest Motif DatasetsSize nSn nSp sSp

603 0.27 0.65 0.5603 0.4 0.76 1603 0.06 0.91 0.33603 0.06 0.79 0.14603 0.19 0.85 0.5603 0.18 0.81 0.33804 0.47 0.66 0.88804 0.27 0.84 0.57804 0.08 0.88 1804 0.25 0.86 0.5

Avg: 0.22 0.80 0.5810 Largest Motif Datasets

Size nSn nSp sSp

3015 0.3 0.67 0.653216 0.47 0.84 0.814221 0.43 0.87 0.645628 0.31 0.82 0.616633 0.24 0.81 0.467035 0.28 0.78 0.827638 0.25 0.77 0.548844 0.42 0.74 0.7213266 0.15 0.89 0.3725326 0.16 0.91 0.42

Avg: 0.30 0.81 0.60



TFBS in Tl (or the site density) should impact the perform-ance of a tool. One would expect that a low density of siteswould result in higher recovery rates (and vice versa for ahigh density of sites). To test this, we computed the totalnumber of sites in Tl ∀ l as annotated by RED. The sitedensity for Tl is the number of sites from RED that exist inTl over the number of sequences in Tl. We graphed sSn,nSp and nCC versus density for each of the tools. For allof the tools, site density did not appear to have any effecton sSn, nSp and nCC over 400 bp upstream sequencesfrom RegulonDB. Figure 12 shows no apparent decreasein nCC as the site density increases. It could be that theredoes not exist enough complex regulatory regions in E. colito notice an impact on performance. It is likely that wewould see a different result for organisms with more com-plex regulatory mechanisms, so we can not rule out sitedensity as a factor in accurate regulatory motif prediction.These results do indicate that the 'background' signal is farmore complex than originally thought and every programhas difficulty distinguishing the foreground TFBS frominterfering background signals.

Site entropyIf the background signals are simple and the binding sitesare complex because they must be conserved by evolu-tion, the relative information content of the TFBS shouldbe greater than the information content found in thebackground signal. If this is the case, there should be arelationship between the information content of the bind-ing site and prediction accuracy. To test this, we calculatedinformation content for each site in RegulonDB usingBioPython and plotted it against nSn, nSp, sSn, (notshown) and nCC (Figure 13). Information content of thesite alone does not appear to be a determinative factor inhow well these programs can recover the site. Perplexed bythis result, we plotted nSn, nSp, sSn, and nCC versusinformation content divided by the number of upstreamsequences in Tk (total number of bp in Tk). The result forsSn is shown in Figure 14.

Figure 14 shows that for some tools, the ratio of informa-tion content of the TFBS to the number of sequences inthe upstream file can play a roll in sSn. Stronger informa-

ROC curves for Ann-Spec (20 (blue), 50 (green), 100 (red), 200 (cyan), 300 (magenta), 400 (yellow), 500 (black), and 800 (lower red) basepairs upstream of the CDSFigure 11ROC curves for Ann-Spec (20 (blue), 50 (green), 100 (red), 200 (cyan), 300 (magenta), 400 (yellow), 500 (black), and 800 (lower red) basepairs upstream of the CDS.



tion content and less background information impliesbetter performance for Weeder, MEME and AlignACE. Forthe other tools, it is not clear if this relationship is present.

DiscussionThe most practical outcome of this work is an ability torank motif prediction tools based on a known TFBS data-set. Tools with favourable performance characteristics canthen be used to discover additional binding sites in closelyrelated genomes or used in conjunction with experimen-tal validation to improve the quality and comprehensive-ness of existing TFBS databases. In RegulonDB, forexample, of the methods tested Weeder, MEME and Alig-nACE present advantages over the other tools. AlignACEpresents a more diverse list with more false positiveswhereas Weeder and MEME present true motif instancesmore often than the other tools. Motif prediction tools arecomposed of both a motif scoring function and a discrim-ination algorithm. The scoring function accepts a motifrepresentation (e.g. a probability weight matrix) and then

calculates the motif prediction candidates based on a dis-crimination function (e.g. maximum likelihood). Dis-crimination algorithms present a computational strategyto approximate the multiple sequence alignment of pre-dicted binding positions relative to all multiple sequencealignments found in the background signal. Both Weederand AlignACE have original scoring functions that couldexplain their utility on RegulonDB. MEME on the otherhand uses expectation maximization as its discriminationalgorithm. It could be that MEME benefits from this strat-egy over programs utilizing gibbs sampling. On the otherhand, it is also likely that the predictions provided bythese programs happen to be better over RegulonDB byrandom chance.

ConclusionIn this paper we have presented a general method, MTAP,for evaluating cis-regulatory motif discovery tools. MTAPis novel and completely different from other approachesin that it allows both algorithm practitioners and users the

Density of the binding sites in Tk versus nCCFigure 12Density of the binding sites in Tk versus nCC.



flexibility to dynamically change attributes of data collec-tion, algorithm parameters, and assessment. Our resultsindicate a clear need toward improvements in each ofthese areas. In our results, we explored four of the mostcommonly attributed factors to prediction accuracy:upstream file size, length of upstream sequences, TFBSdensity, and TFBS information content. The resultsobtained by MTAP in this assessment do not point towardany of these individual factors as playing a critical roll infinding TFBS. The results do indicate that the ratio ofinformation content over upstream file size may have aninfluence on performance for some tools.

The primary innovation in MTAP is not that we produceadditional tools or additional benchmarks, but it is thatwe produce a platform that can be used to improve toolsand the benchmarking process. The results presented inthis paper indicate that the methods used to prepareupstream data, the algorithm, the parameters, and themethod used in evaluation all play important rolls in howwe look at the cis-regulatory motif discovery problem.

In the past, many authors have dismissed bacterial regula-tory motif detection as a far simpler problem than eukary-ote regulatory motif detection (e.g. Xing et. al [35]). Whileit is true that the annotated bacteria regulatory modulesdo not have the same level of complexity and combinato-rial control, our results indicate that even for this 'simple'problem, regulatory motif detection methods have sub-stantial room for improvement.

Unlike other approaches, MTAP allows for the integrationof regulatory regions from other species through an auto-mated procedure. It remains to be seen which integrationprocedure and what combination of closely related anddistantly related species improves performance for toolsthat incorporate regulatory regions from phylogenticallyrelated species. This is the subject of future work. At themoment, it does not appear that two closely relatedstrains are enough to improve performance over conven-tional single sequence approaches. It would be interestingto extend our current implementation of MTAP and assesstools that integrate data from expression arrays and ChIP-chip arrays. Such approaches should lead to an increase in

nCC versus Information Content (IC) for Tl 400 bp from RegulonDBFigure 13nCC versus Information Content (IC) for Tl 400 bp from RegulonDB.



performance, but the parameters and procedures for thisare not currently clear.

It is not currently understood what features of the datamake the problem of finding TFBS so difficult. The keyadvantage of MTAP is that it allows us to explore these fea-tures and propose new models that are more accurate androbust. It is important to understand the performancecharacteristics of the models that have been proposed inthe past before we integrate additional information. Inthis way, we can understand more completely if the rela-tionships found by more sophisticated techniques are realor if they could have occurred by random chance. Furtherexploration into motif representation, motif scoring, andthe relationship between binding sites is still necessary ifwe are to accurately predict regulatory binding sites on thecomputer.

Competing interestsThe authors declare that they have no competing interests.

Authors' contributionsDQ, HA, and DB proposed the design of the system. KDimplemented the prediction pipelines and parsers foreach tool. DQ and KD integrated the databases. DQ andMS implemented the system and migrated the system to aclustered environment. DQ, KD and MS tested the system.DQ conducted the computational experiments and wrotethe manuscript. All authors approved the final manu-script.

AcknowledgementsWe would like to thank each of the authors of motif prediction tools for making their tools available and for helping us in running and automating tool pipelines including their method. We would like to thank Laura A. Quest for help with the manuscript.

This research project was made possible by the NSF grant number EPS-0091900 and the NIH grant number P20 RR16469 from the INBRE Pro-gram of the National Center for Research Resources.

This article has been published as part of BMC Bioinformatics Volume 9 Sup-plement 9, 2008: Proceedings of the Fifth Annual MCBIOS Conference. Sys-

sSn versus information content divided by number of upstream sequencesFigure 14sSn versus information content divided by number of upstream sequences.



Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

tems Biology: Bridging the Omics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S9

References1. Galas DJ, Eggert M, Waterman MS: Rigorous pattern-recognition

methods for DNA sequences. Analysis of promotersequences from Escherichia coli. J Mol Biol 1985,186(1):117-128.

2. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, WoottonJC: Detecting subtle sequence signals: a Gibbs sampling strat-egy for multiple alignment. Science 1993, 262(5131):208-214.

3. Thompson W, Rouchka E, Lawrence C: Gibbs Recursive Sampler:finding transcription factor binding sites. Nucl Acids Res 2003,31(13):3580-3585.

4. Bailey TL, Elkan C: Fitting a mixture model by expectationmaximization to discover motifs in biopolyme rs. Proc Int ConfIntell Syst Mol Biol 1994, 2:28-36.

5. Hertz G, Stormo G: Identifying DNA and protein patterns withstatistically significant alignments of multiple sequences. Bio-informatics 1999, 15(7):563-577.

6. Hertz GZ, Hartzell GW, Stormo GD: Identification of consensuspatterns in unaligned DNA sequences known to be function-ally related. Comput Appl Biosci 1990, 6(2):81-92.

7. Liu X, Brutlag DL, Liu JS: BioProspector: discovering conservedDNA motifs in upstream regulatory regions of co-expressedgenes. Pac Symp Biocomput 2001:127-138.

8. Hughes JD, Estep PW, Tavazoie S, Church GM: Computationalidentification of cis-regulatory elements associated withgroups of functionally related genes in Saccharomyces cere-visiae. J Mol Biol 2000, 296(5):1205-1214.

9. Workman CT, Stormo GD: ANN-Spec: a method for discover-ing transcription factor binding sites with improved specifi-city. Pac Symp Biocomput 2000:467-478.

10. Frith MC, Hansen U, Spouge JL, Weng Z: Finding functionalsequence elements by multiple local alignme nt. Nucleic AcidsRes 2004, 32(1):189-200.

11. Pavesi G, Mauri G, Pesole G: An algorithm for finding signals ofunknown length in DNA sequences. Bioinformatics 2001,17(Suppl 1):.

12. Pavesi G, Mereghetti P, Mauri G, Pesole G: Weeder Web: discov-ery of transcription factor binding sites in a set of sequencesfrom co-regulated genes. Nucleic Acids Res 2004.

13. Che D, Jensen S, Cai L, Liu J: BEST: Binding-site EstimationSuite of Tools. Bioinformatics 2005, 21(12):2909-2911.

14. Hu J, Yang Y, Kihara D: EMD: an ensemble algorithm for discov-ering regulatory motifs in DNA sequences. BMC Bioinformatics2006, 7:342.

15. Sinha S, Blanchette M, Tompa M: PhyME: a probabilistic algo-rithm for finding motifs in sets of orthologous sequences.BMC Bioinformatics 2004, 5:.

16. Siddharthan R, Siggia ED, van Nimwegen E: PhyloGibbs: A GibbsSampling Motif Finder That Incorporates Phylogeny. PLoSComput Biol 2005, 1(7):.

17. Pavesi G, Zambelli F, Pesole G: WeederH: an algorithm for find-ing conserved regulatory motifs and regions in homologoussequences. BMC Bioinformatics 2007, 8:46.

18. Roven C, Bussemaker HJ: REDUCE: An online tool for inferringcis-regulatory elements and transcriptional module activi-ties from microarray data. Nucleic Acids Res 2003,31(13):3487-3490.

19. De Bie T, Monsieurs P, Engelen K, De Moor B, Cristianini N, MarchalK: Discovering transcriptional modules from motif, chip-chipand microarray data. Pacific Symposium on Biocomputing Pacific Sym-posium on Biocomputing 2005:483-494.

20. Tompa M, Li N, Bailey T, Church G, De Moor B, Eskin E, Favorov A,Frith M, Fu Y, Kent J, et al.: Assessing computational tools for thediscovery of transcription factor binding sites. Nature Biotech-nology 2005, 23(1):137-144.

21. Sandve G, Abul O, Walseng V, Drablos F: Improved benchmarksfor computational motif discovery. BMC Bioinformatics 2007,8:193.

22. Klepper K, Sandve G, Abul O, Johansen J, Drablos F: Assessment ofcomposite motif discovery methods. BMC Bioinformatics 2008,9(1):.

23. Hu J, Li B, Kihara D: Limitations and potentials of current motifdiscovery algorithms. Nucleic Acids Research 2005,33(15):4899-4913.

24. Salgado H, Gama-Castro S, Martinez-Antonio A, Diaz-Peredo E,Sanchez-Solano F, Peralta-Gil M, Garcia-Alonso D, Jimenez-Jacinto V,Santos-Zavaleta A, Bonavides-Martinez C, et al.: RegulonDB (ver-sion 4.0): transcriptional regulation, ope ron organizationand growth conditions in Escherichia coli K-12. Nucleic AcidsRes 2004.

25. Price M, Dehal P, Arkin A: Orthologous Transcription Factors inBacteria Have Different Functions and Regulate DifferentGenes. PLoS Computational Biology 2007, 3(9):e175.

26. Quest D, Dempsey K, Shafiullah M, Bastola D, Ali H: A ParallelArchitecture for Regulatory Motif Algorithm Assessment.Hicomb: 2008 2008.

27. Ishii T, Yoshida K-I, Terai G, Fujita Y, Nakai K: DBTBS: a databaseof Bacillus subtilis promoters and transcription factors. NuclAcids Res 2001, 29(1):278-280.

28. Munch R, Hiller K, Barg H, Heldt D, Linz S, Wingender E, Jahn D:PRODORIC: prokaryotic database of gene regulation.Nucleic Acids Res 2003, 31(1):266-269.

29. Kazakov AE, Cipriano MJ, Novichkov PS, Minovitsky S, VinogradovDV, Arkin A, Mironov AA, Gelfand MS, Dubchak I: RegTransBase– a database of regulatory sequences and inte ractions in awide range of prokaryotic genomes. Nucleic Acids Res 2006.

30. Wingender E, Dietze P, Karas H, Knuppel R: TRANSFAC: a data-base on transcription factors and their DNA binding sites.Nucleic Acids Res 1996, 24(1):238-241.

31. Wall DP, Fraser HB, Hirsh AE: Detecting putative orthologs. Bio-informatics 2003, 19(13):1710-1711.

32. ELPH Manual from the Center for Bioinformatics and Com-putational Biology. 2000.

33. Bailey TL, Williams N, Misleh C, Li WW: MEME: discovering andanalyzing DNA and protein sequence motifs. Nucleic Acids Res2006, 34(Web Server issue):W369-W373.

34. Fawcett T: An introduction to ROC analysis. ROC Analysis in Pat-tern Recognition 2006, 27(8):861-874.

35. Xing EP, Wu W, Jordan MI, Karp RM: Logos: a modular bayesianmodel for de novo motif detection. J Bioinform Comput Biol 2004,2(1):127-154.


http://www.biomedcentral.com/1471-2105/9?issue=S9

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=3908689



































































http://www.biomedcentral.com/info/publishing_adv.asp


Documents

MTAP: The Motif Tool Assessment Platform | SpringerLink