Click here to load reader
Upload
susan-jones
View
219
Download
6
Embed Size (px)
Citation preview
Searching for functional sites in protein structuresSusan Jones� and Janet M Thornton
An ability to assign protein function from protein structure is
important for structural genomics consortia. The complex
relationship between protein fold and function highlights the
necessity of looking beyond the global fold of a protein to specific
functional sites. Many computational methods have been
developed that address this issue. These include evolutionary
trace methods, methods that involve the calculation and
assessment of maximal superpositions, methods based on
graph theory, and methods that apply machine learning
techniques. Such function prediction techniques have been
applied to the identification of enzyme catalytic triads and
DNA-binding motifs.
AddressesEuropean Bioinformatics Institute, Wellcome Trust Genome Campus,
Hinxton, Cambridge, CB10 1SD, UK�Correspondence: e-mail: [email protected]
Current Opinion in Chemical Biology 2004, 8:3–7
This review comes from a themed issue on
Proteomics and genomics
Edited by Mark Snyder and John Yates III
1367-5931/$ – see front matter
� 2003 Elsevier Ltd. All rights reserved.
DOI 10.1016/j.cbpa.2003.11.001
AbbreviationsET Evolutionary Trace
HTH helix-turn-helix
PDB Protein Data Bank
RMSD root-mean-squared distribution
IntroductionThe ability to assign function from protein structure is
important for structural genomics consortia in which the
structure of protein sequences are solved that have low
identity to any currently available in the databases. Some
of these consortia aim to fill in the current gaps in globular
protein structure space by targeting specific sequences
[1,2]. Function can, in some cases, be inferred from
structure where there is global fold similarity [3,4]. How-
ever, the more complex problem is to make predictions
about function for protein structures with only distant
relatives or for those with no global fold similarity to any
currently known protein.
The relationship between fold and function is complex.
There is evidence that proteins that exhibit a single fold
can perform many diverse biological functions [5,6] and
conversely that a single function can be achieved by more
than one protein fold [7] (Figure 1). This complex rela-
tionship between fold and function highlights the neces-
sity of looking beyond the global fold of a protein to
specific sites within then.
Recently, there have been many new methods designed
to make predictions on protein function from local struc-
ture similarities. At one level, such methods can be
divided into those that are based on the derivation of
libraries of motifs from known protein structures against
which new structures can be searched, and those based on
the identification of features such as conservation, residue
propensities, recurrence, etc. At another level, such meth-
ods can be divided into three groups on the basis of the
functional predictions made. The first group includes
those designed to be general, detecting local similarities
indicative of many types of protein function. The other
two groups are more specific, one includes methods to
detect enzyme active sites, and the other methods to
detect DNA-binding sites. This review discusses these
new methods based on this second level of classification.
Predicting functional sites using localsimilaritiesFour methods [8�,9�,10,11��] have recently been devel-
oped to identify areas of local similarity in 3D protein
structures. While all the methods use enzyme active sites
for assessment of predictions, all are independent of a
specific protein function type and have the potential to
identify any new functions shared by proteins.
Similarity between protein structures, whether it be
global or local, is commonly measured by calculating
root-mean-squared deviation (RMSD) from the coordi-
nates of protein residues [12]. The function prediction
method by Stark and co-workers [8�] uses a statistical
method to calculate the significance of RMSD calculated
between patterns of residues local in 3D space. They use
a geometric model to estimate the significance a priori,without the necessity to fit the RMSDs of local simila-
rities to background data. The significance measure
allows the differentiation of true functionally significant
patterns shared between proteins from patterns that occur
just by chance. The method has been successfully imple-
mented in a search for the trypsin catalytic triad in the
Protein Data Bank (PDB), in which a previously
unknown Ser-His-Glu triad in the yeast proteasome a-
subunit was detected.
Another way to compare protein structures is to use graph
theory to define the protein structure and make compar-
isons between pairs of graphs or sub-graphs ([13–15]).
www.sciencedirect.com Current Opinion in Chemical Biology 2004, 8:3–7
Wangikar and co-workers [9�] used this theory to repre-
sent a protein structure as a labeled and weighted graph
G(V,E) where the vertices (V) are the functional atoms of
the amino acid side chains, and the edges (E) (where they
exist) are between vertices that are within an interacting
distance. A structural pattern is considered as a sub-
graph. In this way, recurring structural patterns are
detected from structures in the PDB using a backtrack-
ing branch and bound algorithm [16]. The method
detects recurring patterns of functional significance by
applying validated RMSD thresholds and gives an esti-
mate of statistical significance. The method is used to
detect recurring patterns that correlate to known func-
tional sites in 17 protein families, including serine pro-
teases, EF-hands, cupredoxins, ferritins and restriction
endonucleases.
The method developed by Jambon and co-workers [10]
also uses graph theory to make local structure compar-
isons but differs to that of Wangikar [9�] in the way in
which protein structure is represented. Jambon [10]
represents a protein structure as a set of stereochemical
groups that are defined independently from the concept
of the amino acid residue. Each structure is defined at four
levels: atoms, groups of atoms, triangles formed by chem-
ical groups, and graph vertices. Comparisons are made
between protein structures by comparing graphs of chem-
ical group triangles using a heuristic algorithm. Local
similarities are detected by searching for common sub-
graphs. The method has been successfully implemented
in the detection of catalytic triads in serine protease and
sugar binding sites in legume lectins.
The Evolutionary Trace (ET) method [17] ranks the
evolutionary importance of amino acids in a family of
proteins by correlating their variations with evolutionary
divergences. It has been shown in several studies that
amino acids that rank top in such analyses cluster at
functional sites (e.g. [18,19]), and several methods have
utilised the theory in the search for functional sites in
protein structures [20–22]. In a new implementation, Yao
and co-workers [11��] have developed an automated
computational ET method that proves that the overlap
of top ranking amino acids with known functional sites is
statistically significant. The method is implemented on
datasets of ligand binding sites, enzyme active sites and
structures from structural genomics initiatives. The re-
sults show that this ET method is sensitive and accurate
in its identification of functional sites in proteins and this
implementation makes it potentially scalable for struc-
tural proteomics.
Predicting enzyme active sitesThe identification of active sites in enzymes is one of the
specific areas in which functional annotation from struc-
ture methods have concentrated. Enzyme active sites
comprise several catalytic residues with specific spatial
arrangements. One much cited example is the Ser-His-
Asp catalytic triad first observed in chymotrypsin [23].
The identification of catalytic triads was first addressed
using graph theory and led to the development of the
ASSAM algorithm [24,25��]. This algorithm has recently
been updated, enabling searches of 3D templates that
include additional specifications such as accessibility,
disulfide bridges and secondary structure [25��]. The
TESS algorithm was also developed for the identification
of catalytic triads, and was used to derive consensus
structural templates in the PROCAT database [26,27].
This method has been recently updated in the form of
the JESS algorithm [28�], a method for constraint-based
template searching of 3D protein structures. JESS is a
flexible algorithm, unconstrained by template syntax and
semantics. The algorithm allows the search of proteins
for small groups of atoms, and includes an empirical
approach to the normalisation of scores, which gives a
means of judging the significance of matches. This
algorithm has been implemented in the search of enzyme
active sites in the PDB using templates in the PROCAT
database [28�].
The detection of enzyme active sites from 3D protein
structure has also been addressed in a new method that
uses a neural network and clustering [29]. In this work, a
neural network is used to score amino acids by the like-
lihood that they are catalytic. A detailed analysis of
catalytic residues has recently been conducted [30], from
which it was found that catalytic residues have several
common characteristics, including solvent accessibility,
secondary structure type, residue type and conservation.
The experimentally validated dataset of catalytic residues
used in this analysis and their characteristics provide the
basis for training the neural network. The method has
been applied to five recently solved enzyme structures for
which the method correctly identifies the putative active
site in each and identifies some potentially novel func-
tional groups [29].
Figure 1
Current Opinion in Chemical Biology
Function
Fold
43
A B C D
(a) (b) (c)51 2
The complex relationship between protein fold and function. (a) It has
been shown that fold and function rarely have a one-to-one
relationship. What is commonly seen in structures from the PDB is(b) one fold having several different functions, or (c) one function being
conducted by proteins with several different folds.
4 Proteomics and genomics
Current Opinion in Chemical Biology 2004, 8:3–7 www.sciencedirect.com
Predicting DNA-binding sitesProteins that bind DNA are predicted to make up 6–8% of
all eukaryotic genomes [31], and there are currently 3D
structures for 694 proteins bound to DNA molecules in the
Nucleic Acid Database (NDB) [32]). Hence, the predic-
tion of nucleic acid binding sites in proteins is another key
area in function prediction, and several methods have
recently been devised. Two of these methods are based
on detecting a small structural motif that is commonly used
by proteins to bind DNA [33,34]. A third method uses
characteristics of positively charged electrostatic patches
on the protein surface to make predictions [35].
The helix-turn-helix (HTH) motif is one of the most
common motifs used by proteins to bind DNA, being
found in approximately one-third of all DNA binding
proteins [36]. This motif was used as a prototype for two
different methods aimed at identifying the location of
DNA binding sites on proteins. In the first method [33] a
structural template library of seven HTH motifs was
created from non-homologous DNA-binding proteins in
the PDB [37]. The templates were used to scan complete
protein structures using an algorithm that calculated the
RMSD for the optimal superposition of each template on
each structure. Distributions of RMSD values for known
HTH-containing proteins and non-HTH proteins were
analysed and a threshold value calculated below which a
structure was predicted to contain a DNA-binding HTH
motif. These motif templates were shown to be generic,
matching motifs across different fold families. The
second HTH prediction method implements a machine
learning technique based on a series of key structural
features of the 3D motif [34]. These features include a
high average solvent-accessibility of residues within the
recognition motif and a conserved hydrophobic interac-
tion between the recognition helix and the second helix
preceding it. Hence, the method uses structural features
of the protein beyond that of the motif. The method is
used to identify true cases of DNA-binding HTH motifs
within proteins in the PDB to a high degree of accuracy.
The observation that DNA binding sites tend to be the
most positive electrostatic patches on a protein’s surface
is the basis for the third method of DNA binding-site
prediction [35]. Stawiski et al. [35] have presented an
automated method that uses a combination of features
derived for positive electrostatic patches on the protein
surface. The method uses a neural network to discrimi-
nate between DNA-binding and non-DNA-binding
positive electrostatic patches, using 12 sequence and
structural features. These features include hydrogen
bonding potential, amino acid composition, surface con-
cavity and sequence conservation. The method has been
applied to a large dataset of proteins from the PDB and
predicts DNA binding proteins with a high degree of
accuracy, and is capable of predicting those with novel
DNA binding motifs.
Annotating protein functional sites on theWebMany functional site prediction methods have been
implemented as servers or databases on the Web, making
them easily accessible to the structural genomics con-
sortia to whom they are of most relevance.
WebFEATURE (http://feature.stanford.edu/webfeature/)
is a tool to identify and visualise functional sites on
protein structures [38]. This is the web resource of
FEATURE [39] a supervised learning algorithm for auto-
mated discovery of physical and chemical descriptions of
protein microenvironments. The website implements the
scanning algorithm and scoring function enabling func-
tional predictions to be made.
LGA (Local-Global Alignment) (http://predictioncenter.
llnl.gov/local/lga) is a method for finding 3D similarities in
protein structures [40]. The algorithm generates different
local superpositions between pairs of structures to detect
fragments where the structures are similar. The algorithm
allows clustering of similar fragments and the use of such
clusters to identify sequence patterns that would repre-
sent local structural motifs.
PINTS (Patterns in Non-homologous Tertiary Struc-
tures) (http://pints.embl.de) [41] is the web implementa-
tion of a method discussed in the previous section [8�].This resource enables the user to conduct database
searches for common local structural patterns in proteins,
and provides a measure of statistical significance to any
similarity detected.
Two further resources are those that address the predic-
tion of DNA-binding site motifs. The first can be found at
http://www.ebi.ac.uk/thornton-srv/databases/DNA-motifs
[33] and provides a means of scanning protein structures
for the presence the HTH motif. The second is ‘pre-
dictdnahth’ (http://predictdnahth.rutgers.edu/) [34], which
takes a complete protein structure and makes accurate
predictions for the presence of DNA-binding HTH
motifs.
ConclusionsThe number of methods to predict the location of func-
tional sites in newly solved protein structures is steadily
growing, mirroring the growth in the number of protein
structures solved by the structural genomics consortia.
Several methods are independent of a specific protein
function, but these and other specific methods have been
implemented for the identification of enzyme active sites.
The identification of nucleic acid binding sites in proteins
is the second area in which functional annotation methods
have seen recent developments.
Functional prediction methods such as those discussed in
this review are rarely used in isolation. The integration of
Searching for functional sites in protein structures Jones and Thornton 5
www.sciencedirect.com Current Opinion in Chemical Biology 2004, 8:3–7
theoretical and experimental methods is a more common
goal. A recent paper by Sanishvii et al. [42�] describes
an integrated approach for the assignment of function
to a target structure solved by the Midwest Centre for
Structural Genomics (www.mcsg.anl.gov). In this work,
a strategy is described that integrates structural data,
bioinformatics techniques and experimental screening
for the assignment of enzyme activity for the structure
of Escherichia coli BioH.
The integration of experimental and theoretical techni-
ques for functional annotation would be further en-
hanced if there were a central server that provided
access to more than one theoretical method. This is
the aim of ProFunc (Laskowski RA, personal commu-
nication). This application is a web server that integrates
several theoretical prediction techniques that are applied
in a pipeline to protein structures submitted by users.
This and other similar sites that provide effective inte-
gration of function prediction methods will be growing
importance as the activities of the structural genomics
consortia gather pace.
AcknowledgementsSJ was supported by a US Department of Energy Grant CDE-FG02-96ER62166.
References and recommended readingPapers of particular interest, published within the annual period ofreview, have been highlighted as:
� of special interest��of outstanding interest
1. Burley SK: An overview of structural genomics. Nat Struct Biol2000, 7:932-934.
2. Westbrook J, Feng ZK, Chen L, Yang HW, Berman HM: TheProtein Data Bank and structural genomics. Nucl Acid Res 2003,31:489-491.
3. Dietmann S, Holm L: Identification of homology in proteinstructure classification. Nat Struct Biol 2001, 8:953-957.
4. Orengo CA, Jones D, Thornton JM: Protein superfamilies anddomain superfolds. Nature 1994, 372:631-634.
5. Todd AE, Orengo CA, Thornton JM: Evolution of protein function,from a structural perspective. Curr Opin Chem Biol 1999,3:548-556.
6. Thornton JM, Todd AE, Milburn D, Borkakoti N, Orengo CA: Fromstructure to function: approaches and limitations. Nat Struc Biol2000, 7:991-994.
7. Martin AC, Orengo CA, Hutchinson EG, Jones S, Karmirantzou M,Laskowski RA, Mitchell JB, Taroni C, Thornton JM: Protein foldsand functions. Structure 1998, 6:875-884.
8.�
Stark A, Sunyaev S, Russell RB: A model for statisticalsignificance of local similarities in structure. J Mol Biol 2003,326:1307-1316.
A new method to calculate the statistical significance of the RMSDbetween common local patterns in protein structures.
9.�
Wangikar PP, Tendulkar AV, Ramya S, Mail DN, Sarawagi S:Functional sites in protein families uncovered via an objectiveand automated graph theoretic approach. J Mol Biol 2003,326:955-978.
A new prediction approach using graph theory that identifies recurringside-chain patterns in protein structures and includes an empirical cal-culation of statistical significance.
10. Jambon M, Imberty A, Deleage G, Geourjon C: A newbioinformatic approach to detect common 3D sites in proteinstructures. Proteins 2003, 52:137-145.
11.��
Yao H, Kristensen DM, Mihalek I, Sowa ME, Shaw C, Kimmel M,Kavraki L, Lichtarge O: An accurate, sensitive, and scalablemethod to identify functional sites in protein structures.J Mol Biol 2003, 326:255-261.
An automated ET method that ranks the evolutionary importance ofamino acids in protein sequences. This is the first method to quantifythe significance of the overlap observed between the best ranked resi-dues and functional sites.
12. Cohen F, Sternberg M: On the prediction of protein structure: thesignificance of root-mean-square deviation. J Mol Biol 1980,138:321-333.
13. Mitchell EM, Artymiuk PJ, Rice DW, Willett P: Use of techniquesderived from graph-theory to compare secondary structuremotifs in proteins. J Mol Biol 1990, 212:151-166.
14. Artymiuk P, Poirette A, Grindley H, Rice D, Willett P: A graph-theoretic approach to the identification of three-dimensionalpatterns of amino acid side-chains in protein structures.J Mol Biol 1994, 243:327-344.
15. Artymiuk PJ, Bath PA, Grindley HM, Pepperrell CA, Poirrette AR,Rice DW, Thorner DA, Wild DJ, Willett P, Allen FH, Taylor R:Similarity searching in databases of 3-dimensional moleculesand macromolecules. J Chem Info Computer Sci 1992,32:617-630.
16. Bron C, Kerbosch J: Algorithm 457-finding all cliques of anundirected graph. Commun ACM 1971, 16:575-577.
17. Lichtarge O, Bourne H, Cohen F: Evolutionary Trace methoddefines binding surfaces common to protein families. J Mol Biol1996, 257:342-358.
18. Lichtarge O, Yamamoto K, Cohen F: Identification of functionalsurfaces of the zinc binding domains of intracellular receptors.J Mol Biol 1997, 274:325-337.
19. Pritchard L, Dufton MJ: Evolutionary trace analysis of the Kunitz/BPTI family of proteins: functional divergence may havebeen based on conformational adjustment. J Mol Biol 1999,285:1589-1607.
20. Armon A, Graur D, Ben-Tal N: ConSurf: an algorithmic tool for theidentification of functional regions in proteins by surfacemapping of phylogenetic information. J Mol Biol 2001,307:447-463.
21. Hannenhalli SS, Russell RB: Analysis and prediction of functionalsubtypes from protein sequence alignments. J Mol Biol 2000,303:61-76.
22. Madabushi S, Yao H, Marsh M, Kristensen D, Philippi A, Sowa ME,Lichtarge O: Structural clusters of Evolutionary Tree residuesare statistically significant and common in proteins. J Mol Biol2002, 316:139-154.
23. Blow D, Birktoft J, Hartley B: Role of buried acid group in themechanism of action of chymotrypsin. Nature 1969,221:337-340.
24. Artymiuk PJ, Poirrette AR, Grindley HM, Rice DW, Willett P:A graph-theoretic approach to the identification of 3-dimensional patterns of amino-acid side-chains in proteinstructures. J Mol Biol 1994, 243:327-344.
25.��
Spriggs RV, Artymiuk PJ, Willett P: Searching for patterns ofamino acids in 3D protein structures. J Chem Inform Comp Sci2003, 43:412-421.
Presents the background and an important update on the 10 yeardevelopment of the ASSAM programme, which searches for patternsof amino acid side chains in protein structures using a pseudo-atomrepresentation, and the Ullman sub-graph isomorphism algorithm.
26. Wallace AC, Borkakoti N, Thornton JM: TESS: a geometrichashing algorithm for deriving 3D coordinate templates forsearching structural databases. Application to enzyme activesites. Protein Sci 1997, 6:2308-2323.
27. Wallace AC, Laskowski RA, Thornton JM: Derivation of 3Dcoordinate templates for searching structural databases:
6 Proteomics and genomics
Current Opinion in Chemical Biology 2004, 8:3–7 www.sciencedirect.com
application to Ser-His-Asp catalytic triads in the serineproteinases and lipases. Protein Sci 1996, 5:1001-1013.
28.�
Barker JA, Thornton JM: An algorithm for constraint-basedstructural template matching: application to 3D templates withstatistical analysis. Bioinformatics 2003, 19:1644-1649.
Presents JESS, an algorithm for searching protein structures for smallgroups of atoms that includes an empirical measure of significance. Thisalgorithm is different in that it is designed as a flexible core around whichto build constraint-based template search methods.
29. Gutteridge A, Bartlett GJ, Thornton JM: Using a neural networkand spatial clustering to predict the location of active sites inenzymes. J Mol Biol 2003, 330:719-734.
30. Bartlett GJ, Porter CT, Borkakoti N, Thornton JM: Analysis ofcatalytic residues in enzyme active sites. J Mol Biol 2002,324:105-121.
31. Luscombe NM, Greenbaum D, Gerstein M: What isbioinformatics? A proposed definition and overview of the field.Methods Informat Med 2001, 40:346-358.
32. Berman HM, Olson WK, Beveridge DL, Westbrook J, Gelbin A,Demeny T, Hsieh SH, Srinivasan AR, Schneider B: The nucleic-acid database - a comprehensive relational database of3-dimensional structures of nucleic-acids. Biophys J 1992,63:751-759.
33. Jones S, Barker JA, Nobeli I, Thornton JM: Using structural motiftemplates to identify proteins with DNA binding function.Nucl Acid Res 2003, 31:2811-2823.
34. McLaughlin WA, Berman HM: Statistical models for discerningprotein structures containing the DNA-binding helix-turn-helixmotif. J Mol Biol 2003, 330:43-55.
35. Stawiski EW, Gregoret LM, Mandel-Gutfreund Y: Annotatingnucleic acid-binding function based on protein structure.J Mol Biol 2003, 326:1065-1079.
36. Luscombe NM, Austin SE, Berman HM, Thornton JM: An overviewof the structure of protein-DNA complexes. Genome Biol 2000,1:1-37.
37. Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE,Burkhardt K, Iype L, Jain S, Fagan P, Marvin J et al.: The ProteinData Bank. Acta Cryst D 2002, 58:899-907.
38. Liang MP, Banatao DR, Klein TE, Brutlag DL, Altman RB:WebFEATURE: an interactive web tool for identifying andvisualizing functional sites on macromolecular structures.Nucl Acid Res 2003, 31:3324-3327.
39. Bagley SC, Altman RB: Characterizing the microenvironmentsurrounding Pprotein sites. Prot Sci 1995, 4:622-635.
40. Zemla A: LGA: a method for finding 3D similarities in proteinstructures. Nucl Acid Res 2003, 31:3370-3374.
41. Stark A, Russell RB: Annotation in three dimensions. PINTS:patterns in non-homologous tertiary structures. Nucl Acids Res2003, 31:3341-3344.
42.�
Sanishvili R, Yakunin AF, Laskowski RA, Skarina T, Evdokimova E,Doherty-Kirby A, Lajoie GA, Thornton JM, Arrowsmith CH,Savchenko A et al.: Integrating structure, bioinformatics,and enzymology to discover function - BioH, a newcarboxylesterase from Escherichia coli. J Biol Chem 2003,278:26039-26045.
One of the first papers to report on the assignment of function using acombination of theoretical and experimental techniques for a proteinstructure determined by a structural genomics consortium.
Searching for functional sites in protein structures Jones and Thornton 7
www.sciencedirect.com Current Opinion in Chemical Biology 2004, 8:3–7