How accurate and statistically robust are catalytic site predictions based on closeness centrality?

  • Published on

  • View

  • Download


  • BioMed CentralBMC Bioinformatics

    ssOpen AcceResearch articleHow accurate and statistically robust are catalytic site predictions based on closeness centrality?Eric Chea1 and Dennis R Livesay*2

    Address: 1Department of Biological Sciences, California State Polytechnic University, Pomona, CA 91768, USA and 2Department of Computer Science and Bioinformatics Research Center, University of North Carolina at Charlotte, Charlotte, NC 28223, USA

    Email: Eric Chea -; Dennis R Livesay* -

    * Corresponding author

    AbstractBackground: We examine the accuracy of enzyme catalytic residue predictions from a networkrepresentation of protein structure. In this model, amino acid -carbons specify vertices within agraph and edges connect vertices that are proximal in structure. Closeness centrality, which hasshown promise in previous investigations, is used to identify important positions within thenetwork. Closeness centrality, a global measure of network centrality, is calculated as thereciprocal of the average distance between vertex i and all other vertices.

    Results: We benchmark the approach against 283 structurally unique proteins within the CatalyticSite Atlas. Our results, which are inline with previous investigations of smaller datasets, indicatecloseness centrality predictions are statistically significant. However, unlike previous approaches,we specifically focus on residues with the very best scores. Over the top five closeness centralityscores, we observe an average true to false positive rate ratio of 6.8 to 1. As demonstratedpreviously, adding a solvent accessibility filter significantly improves predictive power; the averageratio is increased to 15.3 to 1. We also demonstrate (for the first time) that filtering the predictionsby residue identity improves the results even more than accessibility filtering. Here, we simplyeliminate residues with physiochemical properties unlikely to be compatible with catalyticrequirements from consideration. Residue identity filtering improves the average true to falsepositive rate ratio to 26.3 to 1. Combining the two filters together has little affect on the results.Calculated p-values for the three prediction schemes range from 2.7E-9 to less than 8.8E-134.Finally, the sensitivity of the predictions to structure choice and slight perturbations is examined.

    Conclusion: Our results resolutely confirm that closeness centrality is a viable prediction schemewhose predictions are statistically significant. Simple filtering schemes substantially improve themethod's predicted power. Moreover, no clear effect on performance is observed when comparingligated and unligated structures. Similarly, the CC prediction results are robust to slight structuralperturbations from molecular dynamics simulation.

    Published: 11 May 2007

    BMC Bioinformatics 2007, 8:153 doi:10.1186/1471-2105-8-153

    Received: 11 December 2006Accepted: 11 May 2007

    This article is available from:

    2007 Chea and Livesay; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 14(page number not for citation purposes)

  • BMC Bioinformatics 2007, 8:153

    BackgroundThe accurate and robust prediction of protein functionalsites from sequence and/or structure remains an openproblem in bioinformatics [1]. Despite the limitations ofcurrent methodologies, several sequence and structure-based approaches have recently become popular [2]. Mostof these approaches rely on an underlying multiplesequence alignment and attempt to uncover some type offeature conservation therein [3] (i.e. residues that are con-served across the alignment [4-6]). Arguably, evolutionarytracing has become the most widely used method forcomputational prediction of protein functional sites [7].The Evolutionary trace (ET) approach begins with analignment and corresponding phylogeny. The methodsearches for all alignment positions that recapitulate theoverall phylogeny. While ET is fundamentally a sequence-based scheme, the standard application of the approachuses structural clusters of trace residues to identify func-tional regions [8-10]. Several other related methods thatrely on an underlying alignment plus representative struc-ture have proven useful as well [11-14]. Conversely, wehave introduced a phylogenetic motif-based method thatis similar in spirit to ET, although it is specifically designedto rely solely on sequence information [15-17].

    The literature also contains a host of functional site pre-diction strategies that are explicitly designed to not rely ona phylogeny [18]. These approaches are useful when toofew sequences are available to generate a representativedescription of familial diversity. While their theoreticalfoundations vary considerably, most rely solely on struc-ture or a structure + alignment combination. For example,Gutteridge et al. recently developed a neural networkapproach to predict catalytic sites [19]. Catalytic sites aredefined by residues directly involved in the enzyme-medi-ated reaction mechanism, which generally constitute asubset of all functional residues. The neural network inputof Gutteridge et al. includes both structural and alignmentdescriptors, and is able to correctly predict the active sitein >69% of the cases examined. The ability to rigorouslybenchmark the approach is based on comprehensive data-basing and exhaustive manual curation of catalytic resi-dues from the literature [20] by the same group. This tourde force has led to the Catalytic Site Atlas (CSA) [21],which contains approximately 600 different proteins withexperimentally validated catalytic residues.

    Other common catalytic site prediction methods arebased on Poisson-Boltzmann continuum electrostatictheory [22]. Elcock has observed that functional residuestend to have increased electrostatic strain energy [23],meaning that stabilization occurs on mutation. While theapproach utilizes sophisticated Poisson-Boltzmann con-

    description of protein evolution is that nature solely opti-mizes structural stability at each residue. However, cata-lytic and other important residues have functionalconstraints imposed upon them, meaning that whilemutation might be stabilizing, it can occur at the expenseof functional proficiency. The detangling of stability andfunctional evolutionary pressures is examined more thor-oughly by Cheng et al. using all-atom protein design [24].Analogous to the electrostatic strain energy approach, theTHEMATICS approach uses Poisson-Boltzmann-basedpKa calculations to look for residue titration curves thatdo not follow Henderson-Hasselbalch [25]. The methodlooks for titration curves of partially charged residues thatare flat over a wide pH range. Similarly, we have demon-strated that a large pKa shift from the null model (aque-ous) value can be indicative of catalytic residues [26,27].However, the prediction accuracy of this approach is less-ened because many structurally important residues (i.e.residues involved in a salt bridge) also have significantpKa shifts.

    Network models have also been used with success in pre-dicting protein functional and/or catalytic residues.Instead of representing protein structures as a Cartesiancollection of atoms, network models recast protein struc-tures as topological graphs [28-31]. The most common ofthese methods are based on protein structure contactmaps, where each vertex of the graph represents an -car-bon and edges connect vertices within some distance cut-off (generally 69 ). Once the graph is complete, avariety of topological metrics can be used to predict func-tional residues from it, including: centrality [32,33],valency [32] and sub-graph conservation [34]. Despitegrowing consensus concerning the utility of these meth-ods, a robust assessment of their prediction accuracyremains to be completed. Amitai et al. [32], Thibert et al.[33] and del Sol et al. [35] examine the ability of residuecentrality to predict catalytic and/or functional siteswithin datasets of 178, 128 and 46 proteins, respectively.The results from these studies are encouraging. Moreover,they show that combining centrality within other metricsimproves predictive power. For example, Amitai et al.demonstrates that combining centrality with solventaccessibility substantially improves accuracy, whereasboth Amitai et al. and Thibert et al. demonstrate thatincluding residue conservation improves results.

    In this report, we investigate the accuracy and statisticalsignificance of closeness centrality (CC) functional resi-due predictions, which has previously been shown to bethe best of several different network centrality scores (i.e.valency, betweenness, etc.) [32,33]. Primarily, our investi-gation is based on SCOP [36] superfamily-filtered proteinPage 2 of 14(page number not for citation purposes)

    tinuum theory, the underlying rationale is based onstraightforward evolutionary arguments. The nave

    chains (which represents 283 unique SCOP super-families) from the CSA. Based on observed accuracies, CC

  • BMC Bioinformatics 2007, 8:153

    is demonstrated to be a viable prediction scheme. Ourresults are inline with previous investigations, but aremore significant due to dataset size and compositionsince we control for structural redundancy. A second dis-tinction of this work is that instead of focusing on theentire range of true to false positive rates, as done by pre-vious investigations, we concentrate on the very best CCscores. By focusing only on the top five scoring residues,we are able to evaluate the ability of the model to provideinsight that provides a reasonable number of experimen-tally testable predictions. In all cases, our predictions cor-respond to false positive rates below 1.6%. Theperformance of the method is improved substantially byconsidering only residues that are not completely inacces-sible to solvent. We further demonstrate that filtering thepredictions based solely on amino acid identity substan-tially improves predictive power even more than filteringby solvent accessibility.

    Theoretical backgroundThroughout this report, the vertices within each graph cor-respond to -carbons. Edges connect two -carbonswithin 8.5 of each other. While slightly less complicatedthan methods based on all-atom pair distances, the sim-pler model results in a noticeable computational speedupthat significant when analyzing a dataset the size of ours.A cursory comparison of the two networks indicates thatthe resultant predictions are qualitatively similar (resultsnot shown). The common threshold of 8.5 is usedbecause it best approximates the average sidechain size.Closeness centrality (CC), a global centrality metric, isused to determine how critical each vertex (residue) is inmaintaining the small-world behavior of the graph. CC iscalculated by:

    where Np is the total number of vertices in the graph andLij is the shortest path (geodesic distance) between verticesi and j. The shortest path is simply the minimum of allpossible paths between residues i and j. As normally donein protein structure networks, edges are not weighted,making the shortest path simply an integer count of thenumber of edges separating i and j. It should be noted thatNp (a constant within each protein) has no effect on ourobserved results since we are only using CC to rank theresidues, meaning the inverse of shortest path sum solelyestablishes which residues are ultimately predicted. Nev-ertheless, we employ CC here to be consistent with previ-ous investigations.

    Results and discussionProbability density functionsMapping CC to structure clearly indicates that residueswith high centralities are most likely to occur within theprotein core. As is the case in the three examples shown inFig. 1, catalytic residues frequently do not correspond tothe most central residues. Nevertheless, Fig. 2a indicatesthat there is clear discrimination between the CC proba-bility density functions (PDFs) of catalytic and noncata-lytic residues. The data plotted within in Fig. 2 is takenfrom 283 structurally unique protein chains; meaning notwo proteins from a single SCOP superfamily areincluded. This translates to 96,280 noncatalytic residuesand 844 catalytic residues. The PDFs describing datasetsparsed by SCOP family (423 proteins) and 80% pairwisesequence identity (568 proteins) are virtually identical tothose shown. The average CC values for the catalytic andnoncatalytic residues are 0.19 and 0.16, respectively.While Fig. 2a suggests that the most extreme CC scores arenot likely to be catalytic, catalytic residues are, on average,more central than noncatalytic residues. A two-sample t-test resolutely confirms that the discrimination betweenthe means is statistically significant (t = 2.0; p = 1.6E-73;sample size = 7,372). Nevertheless, there is appreciableoverlap (59.5%) between the two PDFs.

    Going further, Fig. 2b compares the PDFs of residues fromthree accessibility levels to the catalytic residue PDF. Thethree accessibility levels roughly correspond to the thirdmost buried, middle third and third most exposed resi-dues within the parsed dataset. At each accessibility level,the catalytic residue PDF has a statistically significantincrease within its mean value (see Table 1). As discussedabove, this result is slightly counterintuitive because themost buried (and thus, most central) residues frequentlyare not catalytic. Rather, this result demonstrates that cat-alytic residues are, on average, more central than the topthird most buried residues. Again, this result confirms theearlier observations of Amitai. Yet, caution should beexercised when drawing far-reaching conclusions basedon this analysis due to the considerable overlap betweenthe distributions. This is especially true in the case of theburied residues, which has 85% overlap with the catalyticPDF.

    Assessing prediction accuracy of top closeness centrality scoresAs stated above, several investigations have examined theprediction accuracy of global centrality metrics; however,none of the previous investigations are on the scale of thisreport. Nor, have any rigorously controlled for structuralredundancy as we do here. Of the previous reports, thelargest dataset investigated is 178 proteins [32], which

    CCi = N



    (1)Page 3 of 14(page number not for citation purposes)

    (unlike ours) contained redundant structural folds. Previ-ous investigations use Receiver Operating Characteristic

  • BMC Bioinformatics 2007, 8:153

    (ROC) plots to examine the balance between true andfalse positive rates over the entire relationship continuum.A false positive rate greater than 9% is commonly consid-ered; Thiber...


View more >