How accurate and statistically robust are catalytic site predictions based on closeness centrality?

Embed Size (px)

Text of How accurate and statistically robust are catalytic site predictions based on closeness centrality?

  • BioMed CentralBMC Bioinformatics

    ssOpen AcceResearch articleHow accurate and statistically robust are catalytic site predictions based on closeness centrality?Eric Chea1 and Dennis R Livesay*2

    Address: 1Department of Biological Sciences, California State Polytechnic University, Pomona, CA 91768, USA and 2Department of Computer Science and Bioinformatics Research Center, University of North Carolina at Charlotte, Charlotte, NC 28223, USA

    Email: Eric Chea - erc2009@med.cornell.edu; Dennis R Livesay* - drlivesa@uncc.edu

    * Corresponding author

    AbstractBackground: We examine the accuracy of enzyme catalytic residue predictions from a networkrepresentation of protein structure. In this model, amino acid -carbons specify vertices within agraph and edges connect vertices that are proximal in structure. Closeness centrality, which hasshown promise in previous investigations, is used to identify important positions within thenetwork. Closeness centrality, a global measure of network centrality, is calculated as thereciprocal of the average distance between vertex i and all other vertices.

    Results: We benchmark the approach against 283 structurally unique proteins within the CatalyticSite Atlas. Our results, which are inline with previous investigations of smaller datasets, indicatecloseness centrality predictions are statistically significant. However, unlike previous approaches,we specifically focus on residues with the very best scores. Over the top five closeness centralityscores, we observe an average true to false positive rate ratio of 6.8 to 1. As demonstratedpreviously, adding a solvent accessibility filter significantly improves predictive power; the averageratio is increased to 15.3 to 1. We also demonstrate (for the first time) that filtering the predictionsby residue identity improves the results even more than accessibility filtering. Here, we simplyeliminate residues with physiochemical properties unlikely to be compatible with catalyticrequirements from consideration. Residue identity filtering improves the average true to falsepositive rate ratio to 26.3 to 1. Combining the two filters together has little affect on the results.Calculated p-values for the three prediction schemes range from 2.7E-9 to less than 8.8E-134.Finally, the sensitivity of the predictions to structure choice and slight perturbations is examined.

    Conclusion: Our results resolutely confirm that closeness centrality is a viable prediction schemewhose predictions are statistically significant. Simple filtering schemes substantially improve themethod's predicted power. Moreover, no clear effect on performance is observed when comparingligated and unligated structures. Similarly, the CC prediction results are robust to slight structuralperturbations from molecular dynamics simulation.

    Published: 11 May 2007

    BMC Bioinformatics 2007, 8:153 doi:10.1186/1471-2105-8-153

    Received: 11 December 2006Accepted: 11 May 2007

    This article is available from: http://www.biomedcentral.com/1471-2105/8/153

    2007 Chea and Livesay; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 14(page number not for citation purposes)

  • BMC Bioinformatics 2007, 8:153 http://www.biomedcentral.com/1471-2105/8/153

    BackgroundThe accurate and robust prediction of protein functionalsites from sequence and/or structure remains an openproblem in bioinformatics [1]. Despite the limitations ofcurrent methodologies, several sequence and structure-based approaches have recently become popular [2]. Mostof these approaches rely on an underlying multiplesequence alignment and attempt to uncover some type offeature conservation therein [3] (i.e. residues that are con-served across the alignment [4-6]). Arguably, evolutionarytracing has become the most widely used method forcomputational prediction of protein functional sites [7].The Evolutionary trace (ET) approach begins with analignment and corresponding phylogeny. The methodsearches for all alignment positions that recapitulate theoverall phylogeny. While ET is fundamentally a sequence-based scheme, the standard application of the approachuses structural clusters of trace residues to identify func-tional regions [8-10]. Several other related methods thatrely on an underlying alignment plus representative struc-ture have proven useful as well [11-14]. Conversely, wehave introduced a phylogenetic motif-based method thatis similar in spirit to ET, although it is specifically designedto rely solely on sequence information [15-17].

    The literature also contains a host of functional site pre-diction strategies that are explicitly designed to not rely ona phylogeny [18]. These approaches are useful when toofew sequences are available to generate a representativedescription of familial diversity. While their theoreticalfoundations vary considerably, most rely solely on struc-ture or a structure + alignment combination. For example,Gutteridge et al. recently developed a neural networkapproach to predict catalytic sites [19]. Catalytic sites aredefined by residues directly involved in the enzyme-medi-ated reaction mechanism, which generally constitute asubset of all functional residues. The neural network inputof Gutteridge et al. includes both structural and alignmentdescriptors, and is able to correctly predict the active sitein >69% of the cases examined. The ability to rigorouslybenchmark the approach is based on comprehensive data-basing and exhaustive manual curation of catalytic resi-dues from the literature [20] by the same group. This tourde force has led to the Catalytic Site Atlas (CSA) [21],which contains approximately 600 different proteins withexperimentally validated catalytic residues.

    Other common catalytic site prediction methods arebased on Poisson-Boltzmann continuum electrostatictheory [22]. Elcock has observed that functional residuestend to have increased electrostatic strain energy [23],meaning that stabilization occurs on mutation. While theapproach utilizes sophisticated Poisson-Boltzmann con-

    description of protein evolution is that nature solely opti-mizes structural stability at each residue. However, cata-lytic and other important residues have functionalconstraints imposed upon them, meaning that whilemutation might be stabilizing, it can occur at the expenseof functional proficiency. The detangling of stability andfunctional evolutionary pressures is examined more thor-oughly by Cheng et al. using all-atom protein design [24].Analogous to the electrostatic strain energy approach, theTHEMATICS approach uses Poisson-Boltzmann-basedpKa calculations to look for residue titration curves thatdo not follow Henderson-Hasselbalch [25]. The methodlooks for titration curves of partially charged residues thatare flat over a wide pH range. Similarly, we have demon-strated that a large pKa shift from the null model (aque-ous) value can be indicative of catalytic residues [26,27].However, the prediction accuracy of this approach is less-ened because many structurally important residues (i.e.residues involved in a salt bridge) also have significantpKa shifts.

    Network models have also been used with success in pre-dicting protein functional and/or catalytic residues.Instead of representing protein structures as a Cartesiancollection of atoms, network models recast protein struc-tures as topological graphs [28-31]. The most common ofthese methods are based on protein structure contactmaps, where each vertex of the graph represents an -car-bon and edges connect vertices within some distance cut-off (generally 69 ). Once the graph is complete, avariety of topological metrics can be used to predict func-tional residues from it, including: centrality [32,33],valency [32] and sub-graph conservation [34]. Despitegrowing consensus concerning the utility of these meth-ods, a robust assessment of their prediction accuracyremains to be completed. Amitai et al. [32], Thibert et al.[33] and del Sol et al. [35] examine the ability of residuecentrality to predict catalytic and/or functional siteswithin datasets of 178, 128 and 46 proteins, respectively.The results from these studies are encouraging. Moreover,they show that combining centrality within other metricsimproves predictive power. For example, Amitai et al.demonstrates that combining centrality with solventaccessibility substantially improves accuracy, whereasboth Amitai et al. and Thibert et al. demonstrate thatincluding residue conservation improves results.

    In this report, we investigate the accuracy and statisticalsignificance of closeness centrality (CC) functional resi-due predictions, which has previously been shown to bethe best of several different network centrality scores (i.e.valency, betweenness, etc.) [32,33]. Primarily, our investi-gation is based on SCOP [36] superfamily-filtered proteinPage 2 of 14(page number not for citation purposes)

    tinuum theory, the underlying rationale is based onstraightforward evolutionary arguments. The nave

    chains (which represents 283 unique SCOP super-families) from the CSA. Based on observed accuracies, CC

  • BMC Bioinformatics 2007, 8:153 http://www.biomedcentral.com/1471-2105/8/153

    is demonstrated to be a viable prediction scheme. Ourresults are inline with previous investigations, but aremore significant due to dataset size and compositionsince we control for structural redundancy. A second dis-tinction of this work is that instead of focusing on theentire range of true to false positive rates, as done by pre-vious investigations, we concentrate on the very best CCscores. By focusing only on the top five scoring