Searching for functional sites in protein structures

Searching for functional sites in protein structuresSusan Jones� and Janet M Thornton

An ability to assign protein function from protein structure is

important for structural genomics consortia. The complex

relationship between protein fold and function highlights the

necessity of looking beyond the global fold of a protein to specific

functional sites. Many computational methods have been

developed that address this issue. These include evolutionary

trace methods, methods that involve the calculation and

assessment of maximal superpositions, methods based on

graph theory, and methods that apply machine learning

techniques. Such function prediction techniques have been

applied to the identification of enzyme catalytic triads and

DNA-binding motifs.

AddressesEuropean Bioinformatics Institute, Wellcome Trust Genome Campus,

Hinxton, Cambridge, CB10 1SD, UK�Correspondence: e-mail: [email protected]

Current Opinion in Chemical Biology 2004, 8:3–7

This review comes from a themed issue on

Proteomics and genomics

Edited by Mark Snyder and John Yates III

1367-5931/$ – see front matter

� 2003 Elsevier Ltd. All rights reserved.

DOI 10.1016/j.cbpa.2003.11.001

AbbreviationsET Evolutionary Trace

HTH helix-turn-helix

PDB Protein Data Bank

RMSD root-mean-squared distribution

IntroductionThe ability to assign function from protein structure is

important for structural genomics consortia in which the

structure of protein sequences are solved that have low

identity to any currently available in the databases. Some

of these consortia aim to fill in the current gaps in globular

protein structure space by targeting specific sequences

[1,2]. Function can, in some cases, be inferred from

structure where there is global fold similarity [3,4]. How-

ever, the more complex problem is to make predictions

about function for protein structures with only distant

relatives or for those with no global fold similarity to any

currently known protein.

The relationship between fold and function is complex.

There is evidence that proteins that exhibit a single fold

can perform many diverse biological functions [5,6] and

conversely that a single function can be achieved by more

than one protein fold [7] (Figure 1). This complex rela-

tionship between fold and function highlights the neces-

sity of looking beyond the global fold of a protein to

specific sites within then.

Recently, there have been many new methods designed

to make predictions on protein function from local struc-

ture similarities. At one level, such methods can be

divided into those that are based on the derivation of

libraries of motifs from known protein structures against

which new structures can be searched, and those based on

the identification of features such as conservation, residue

propensities, recurrence, etc. At another level, such meth-

ods can be divided into three groups on the basis of the

functional predictions made. The first group includes

those designed to be general, detecting local similarities

indicative of many types of protein function. The other

two groups are more specific, one includes methods to

detect enzyme active sites, and the other methods to

detect DNA-binding sites. This review discusses these

new methods based on this second level of classification.

Predicting functional sites using localsimilaritiesFour methods [8�,9�,10,11��] have recently been devel-

oped to identify areas of local similarity in 3D protein

structures. While all the methods use enzyme active sites

for assessment of predictions, all are independent of a

specific protein function type and have the potential to

identify any new functions shared by proteins.

Similarity between protein structures, whether it be

global or local, is commonly measured by calculating

root-mean-squared deviation (RMSD) from the coordi-

nates of protein residues [12]. The function prediction

method by Stark and co-workers [8�] uses a statistical

method to calculate the significance of RMSD calculated

between patterns of residues local in 3D space. They use

a geometric model to estimate the significance a priori,without the necessity to fit the RMSDs of local simila-

rities to background data. The significance measure

allows the differentiation of true functionally significant

patterns shared between proteins from patterns that occur

just by chance. The method has been successfully imple-

mented in a search for the trypsin catalytic triad in the

Protein Data Bank (PDB), in which a previously

unknown Ser-His-Glu triad in the yeast proteasome a-

subunit was detected.

Another way to compare protein structures is to use graph

theory to define the protein structure and make compar-

isons between pairs of graphs or sub-graphs ([13–15]).

www.sciencedirect.com Current Opinion in Chemical Biology 2004, 8:3–7

Wangikar and co-workers [9�] used this theory to repre-

sent a protein structure as a labeled and weighted graph

G(V,E) where the vertices (V) are the functional atoms of

the amino acid side chains, and the edges (E) (where they

exist) are between vertices that are within an interacting

distance. A structural pattern is considered as a sub-

graph. In this way, recurring structural patterns are

detected from structures in the PDB using a backtrack-

ing branch and bound algorithm [16]. The method

detects recurring patterns of functional significance by

applying validated RMSD thresholds and gives an esti-

mate of statistical significance. The method is used to

detect recurring patterns that correlate to known func-

tional sites in 17 protein families, including serine pro-

teases, EF-hands, cupredoxins, ferritins and restriction

endonucleases.

The method developed by Jambon and co-workers [10]

also uses graph theory to make local structure compar-

isons but differs to that of Wangikar [9�] in the way in

which protein structure is represented. Jambon [10]

represents a protein structure as a set of stereochemical

groups that are defined independently from the concept

of the amino acid residue. Each structure is defined at four

levels: atoms, groups of atoms, triangles formed by chem-

ical groups, and graph vertices. Comparisons are made

between protein structures by comparing graphs of chem-

ical group triangles using a heuristic algorithm. Local

similarities are detected by searching for common sub-

graphs. The method has been successfully implemented

in the detection of catalytic triads in serine protease and

sugar binding sites in legume lectins.

The Evolutionary Trace (ET) method [17] ranks the

evolutionary importance of amino acids in a family of

proteins by correlating their variations with evolutionary

divergences. It has been shown in several studies that

amino acids that rank top in such analyses cluster at

functional sites (e.g. [18,19]), and several methods have

utilised the theory in the search for functional sites in

protein structures [20–22]. In a new implementation, Yao

and co-workers [11��] have developed an automated

computational ET method that proves that the overlap

of top ranking amino acids with known functional sites is

statistically significant. The method is implemented on

datasets of ligand binding sites, enzyme active sites and

structures from structural genomics initiatives. The re-

sults show that this ET method is sensitive and accurate

in its identification of functional sites in proteins and this

implementation makes it potentially scalable for struc-

tural proteomics.

Predicting enzyme active sitesThe identification of active sites in enzymes is one of the

specific areas in which functional annotation from struc-

ture methods have concentrated. Enzyme active sites

comprise several catalytic residues with specific spatial

arrangements. One much cited example is the Ser-His-

Asp catalytic triad first observed in chymotrypsin [23].

The identification of catalytic triads was first addressed

using graph theory and led to the development of the

ASSAM algorithm [24,25��]. This algorithm has recently

been updated, enabling searches of 3D templates that

include additional specifications such as accessibility,

disulfide bridges and secondary structure [25��]. The

TESS algorithm was also developed for the identification

of catalytic triads, and was used to derive consensus

structural templates in the PROCAT database [26,27].

This method has been recently updated in the form of

the JESS algorithm [28�], a method for constraint-based

template searching of 3D protein structures. JESS is a

flexible algorithm, unconstrained by template syntax and

semantics. The algorithm allows the search of proteins

for small groups of atoms, and includes an empirical

approach to the normalisation of scores, which gives a

means of judging the significance of matches. This

algorithm has been implemented in the search of enzyme

active sites in the PDB using templates in the PROCAT

database [28�].

The detection of enzyme active sites from 3D protein

structure has also been addressed in a new method that

uses a neural network and clustering [29]. In this work, a

neural network is used to score amino acids by the like-

lihood that they are catalytic. A detailed analysis of

catalytic residues has recently been conducted [30], from

which it was found that catalytic residues have several

common characteristics, including solvent accessibility,

secondary structure type, residue type and conservation.

The experimentally validated dataset of catalytic residues

used in this analysis and their characteristics provide the

basis for training the neural network. The method has

been applied to five recently solved enzyme structures for

which the method correctly identifies the putative active

site in each and identifies some potentially novel func-

tional groups [29].

Figure 1

Current Opinion in Chemical Biology

Function

Fold

43

A B C D

(a) (b) (c)51 2

The complex relationship between protein fold and function. (a) It has

been shown that fold and function rarely have a one-to-one

relationship. What is commonly seen in structures from the PDB is(b) one fold having several different functions, or (c) one function being

conducted by proteins with several different folds.

4 Proteomics and genomics

Current Opinion in Chemical Biology 2004, 8:3–7 www.sciencedirect.com

Predicting DNA-binding sitesProteins that bind DNA are predicted to make up 6–8% of

all eukaryotic genomes [31], and there are currently 3D

structures for 694 proteins bound to DNA molecules in the

Nucleic Acid Database (NDB) [32]). Hence, the predic-

tion of nucleic acid binding sites in proteins is another key

area in function prediction, and several methods have

recently been devised. Two of these methods are based

on detecting a small structural motif that is commonly used

by proteins to bind DNA [33,34]. A third method uses

characteristics of positively charged electrostatic patches

on the protein surface to make predictions [35].

The helix-turn-helix (HTH) motif is one of the most

common motifs used by proteins to bind DNA, being

found in approximately one-third of all DNA binding

proteins [36]. This motif was used as a prototype for two

different methods aimed at identifying the location of

DNA binding sites on proteins. In the first method [33] a

structural template library of seven HTH motifs was

created from non-homologous DNA-binding proteins in

the PDB [37]. The templates were used to scan complete

protein structures using an algorithm that calculated the

RMSD for the optimal superposition of each template on

each structure. Distributions of RMSD values for known

HTH-containing proteins and non-HTH proteins were

analysed and a threshold value calculated below which a

structure was predicted to contain a DNA-binding HTH

motif. These motif templates were shown to be generic,

matching motifs across different fold families. The

second HTH prediction method implements a machine

learning technique based on a series of key structural

features of the 3D motif [34]. These features include a

high average solvent-accessibility of residues within the

recognition motif and a conserved hydrophobic interac-

tion between the recognition helix and the second helix

preceding it. Hence, the method uses structural features

of the protein beyond that of the motif. The method is

used to identify true cases of DNA-binding HTH motifs

within proteins in the PDB to a high degree of accuracy.

The observation that DNA binding sites tend to be the

most positive electrostatic patches on a protein’s surface

is the basis for the third method of DNA binding-site

prediction [35]. Stawiski et al. [35] have presented an

automated method that uses a combination of features

derived for positive electrostatic patches on the protein

surface. The method uses a neural network to discrimi-

nate between DNA-binding and non-DNA-binding

positive electrostatic patches, using 12 sequence and

structural features. These features include hydrogen

bonding potential, amino acid composition, surface con-

cavity and sequence conservation. The method has been

applied to a large dataset of proteins from the PDB and

predicts DNA binding proteins with a high degree of

accuracy, and is capable of predicting those with novel

DNA binding motifs.

Annotating protein functional sites on theWebMany functional site prediction methods have been

implemented as servers or databases on the Web, making

them easily accessible to the structural genomics con-

sortia to whom they are of most relevance.

WebFEATURE (http://feature.stanford.edu/webfeature/)

is a tool to identify and visualise functional sites on

protein structures [38]. This is the web resource of

FEATURE [39] a supervised learning algorithm for auto-

mated discovery of physical and chemical descriptions of

protein microenvironments. The website implements the

scanning algorithm and scoring function enabling func-

tional predictions to be made.

LGA (Local-Global Alignment) (http://predictioncenter.

llnl.gov/local/lga) is a method for finding 3D similarities in

protein structures [40]. The algorithm generates different

local superpositions between pairs of structures to detect

fragments where the structures are similar. The algorithm

allows clustering of similar fragments and the use of such

clusters to identify sequence patterns that would repre-

sent local structural motifs.

PINTS (Patterns in Non-homologous Tertiary Struc-

tures) (http://pints.embl.de) [41] is the web implementa-

tion of a method discussed in the previous section [8�].This resource enables the user to conduct database

searches for common local structural patterns in proteins,

and provides a measure of statistical significance to any

similarity detected.

Two further resources are those that address the predic-

tion of DNA-binding site motifs. The first can be found at

http://www.ebi.ac.uk/thornton-srv/databases/DNA-motifs

[33] and provides a means of scanning protein structures

for the presence the HTH motif. The second is ‘pre-

dictdnahth’ (http://predictdnahth.rutgers.edu/) [34], which

takes a complete protein structure and makes accurate

predictions for the presence of DNA-binding HTH

motifs.

ConclusionsThe number of methods to predict the location of func-

tional sites in newly solved protein structures is steadily

growing, mirroring the growth in the number of protein

structures solved by the structural genomics consortia.

Several methods are independent of a specific protein

function, but these and other specific methods have been

implemented for the identification of enzyme active sites.

The identification of nucleic acid binding sites in proteins

is the second area in which functional annotation methods

have seen recent developments.

Functional prediction methods such as those discussed in

this review are rarely used in isolation. The integration of

Searching for functional sites in protein structures Jones and Thornton 5


http://feature.stanford.edu/webfeature/

http://predictioncenter.

http://llnl.gov/local/lga

http://pints.embl.de

http://www.ebi.ac.uk/thornton-srv/databases/DNA-motifs

http://predictdnahth.rutgers.edu/

theoretical and experimental methods is a more common

goal. A recent paper by Sanishvii et al. [42�] describes

an integrated approach for the assignment of function

to a target structure solved by the Midwest Centre for

Structural Genomics (www.mcsg.anl.gov). In this work,

a strategy is described that integrates structural data,

bioinformatics techniques and experimental screening

for the assignment of enzyme activity for the structure

of Escherichia coli BioH.

The integration of experimental and theoretical techni-

ques for functional annotation would be further en-

hanced if there were a central server that provided

access to more than one theoretical method. This is

the aim of ProFunc (Laskowski RA, personal commu-

nication). This application is a web server that integrates

several theoretical prediction techniques that are applied

in a pipeline to protein structures submitted by users.

This and other similar sites that provide effective inte-

gration of function prediction methods will be growing

importance as the activities of the structural genomics

consortia gather pace.

AcknowledgementsSJ was supported by a US Department of Energy Grant CDE-FG02-96ER62166.

References and recommended readingPapers of particular interest, published within the annual period ofreview, have been highlighted as:

� of special interest��of outstanding interest

1. Burley SK: An overview of structural genomics. Nat Struct Biol2000, 7:932-934.

2. Westbrook J, Feng ZK, Chen L, Yang HW, Berman HM: TheProtein Data Bank and structural genomics. Nucl Acid Res 2003,31:489-491.

3. Dietmann S, Holm L: Identification of homology in proteinstructure classification. Nat Struct Biol 2001, 8:953-957.

4. Orengo CA, Jones D, Thornton JM: Protein superfamilies anddomain superfolds. Nature 1994, 372:631-634.

5. Todd AE, Orengo CA, Thornton JM: Evolution of protein function,from a structural perspective. Curr Opin Chem Biol 1999,3:548-556.

6. Thornton JM, Todd AE, Milburn D, Borkakoti N, Orengo CA: Fromstructure to function: approaches and limitations. Nat Struc Biol2000, 7:991-994.

7. Martin AC, Orengo CA, Hutchinson EG, Jones S, Karmirantzou M,Laskowski RA, Mitchell JB, Taroni C, Thornton JM: Protein foldsand functions. Structure 1998, 6:875-884.

8.�

Stark A, Sunyaev S, Russell RB: A model for statisticalsignificance of local similarities in structure. J Mol Biol 2003,326:1307-1316.

A new method to calculate the statistical significance of the RMSDbetween common local patterns in protein structures.

9.�

Wangikar PP, Tendulkar AV, Ramya S, Mail DN, Sarawagi S:Functional sites in protein families uncovered via an objectiveand automated graph theoretic approach. J Mol Biol 2003,326:955-978.

A new prediction approach using graph theory that identifies recurringside-chain patterns in protein structures and includes an empirical cal-culation of statistical significance.

10. Jambon M, Imberty A, Deleage G, Geourjon C: A newbioinformatic approach to detect common 3D sites in proteinstructures. Proteins 2003, 52:137-145.

11.��

Yao H, Kristensen DM, Mihalek I, Sowa ME, Shaw C, Kimmel M,Kavraki L, Lichtarge O: An accurate, sensitive, and scalablemethod to identify functional sites in protein structures.J Mol Biol 2003, 326:255-261.

An automated ET method that ranks the evolutionary importance ofamino acids in protein sequences. This is the first method to quantifythe significance of the overlap observed between the best ranked resi-dues and functional sites.

12. Cohen F, Sternberg M: On the prediction of protein structure: thesignificance of root-mean-square deviation. J Mol Biol 1980,138:321-333.

13. Mitchell EM, Artymiuk PJ, Rice DW, Willett P: Use of techniquesderived from graph-theory to compare secondary structuremotifs in proteins. J Mol Biol 1990, 212:151-166.

14. Artymiuk P, Poirette A, Grindley H, Rice D, Willett P: A graph-theoretic approach to the identification of three-dimensionalpatterns of amino acid side-chains in protein structures.J Mol Biol 1994, 243:327-344.

15. Artymiuk PJ, Bath PA, Grindley HM, Pepperrell CA, Poirrette AR,Rice DW, Thorner DA, Wild DJ, Willett P, Allen FH, Taylor R:Similarity searching in databases of 3-dimensional moleculesand macromolecules. J Chem Info Computer Sci 1992,32:617-630.

16. Bron C, Kerbosch J: Algorithm 457-finding all cliques of anundirected graph. Commun ACM 1971, 16:575-577.

17. Lichtarge O, Bourne H, Cohen F: Evolutionary Trace methoddefines binding surfaces common to protein families. J Mol Biol1996, 257:342-358.

18. Lichtarge O, Yamamoto K, Cohen F: Identification of functionalsurfaces of the zinc binding domains of intracellular receptors.J Mol Biol 1997, 274:325-337.

19. Pritchard L, Dufton MJ: Evolutionary trace analysis of the Kunitz/BPTI family of proteins: functional divergence may havebeen based on conformational adjustment. J Mol Biol 1999,285:1589-1607.

20. Armon A, Graur D, Ben-Tal N: ConSurf: an algorithmic tool for theidentification of functional regions in proteins by surfacemapping of phylogenetic information. J Mol Biol 2001,307:447-463.

21. Hannenhalli SS, Russell RB: Analysis and prediction of functionalsubtypes from protein sequence alignments. J Mol Biol 2000,303:61-76.

22. Madabushi S, Yao H, Marsh M, Kristensen D, Philippi A, Sowa ME,Lichtarge O: Structural clusters of Evolutionary Tree residuesare statistically significant and common in proteins. J Mol Biol2002, 316:139-154.

23. Blow D, Birktoft J, Hartley B: Role of buried acid group in themechanism of action of chymotrypsin. Nature 1969,221:337-340.

24. Artymiuk PJ, Poirrette AR, Grindley HM, Rice DW, Willett P:A graph-theoretic approach to the identification of 3-dimensional patterns of amino-acid side-chains in proteinstructures. J Mol Biol 1994, 243:327-344.

25.��

Spriggs RV, Artymiuk PJ, Willett P: Searching for patterns ofamino acids in 3D protein structures. J Chem Inform Comp Sci2003, 43:412-421.

Presents the background and an important update on the 10 yeardevelopment of the ASSAM programme, which searches for patternsof amino acid side chains in protein structures using a pseudo-atomrepresentation, and the Ullman sub-graph isomorphism algorithm.

26. Wallace AC, Borkakoti N, Thornton JM: TESS: a geometrichashing algorithm for deriving 3D coordinate templates forsearching structural databases. Application to enzyme activesites. Protein Sci 1997, 6:2308-2323.

27. Wallace AC, Laskowski RA, Thornton JM: Derivation of 3Dcoordinate templates for searching structural databases:

6 Proteomics and genomics

Current Opinion in Chemical Biology 2004, 8:3–7 www.sciencedirect.com

application to Ser-His-Asp catalytic triads in the serineproteinases and lipases. Protein Sci 1996, 5:1001-1013.

28.�

Barker JA, Thornton JM: An algorithm for constraint-basedstructural template matching: application to 3D templates withstatistical analysis. Bioinformatics 2003, 19:1644-1649.

Presents JESS, an algorithm for searching protein structures for smallgroups of atoms that includes an empirical measure of significance. Thisalgorithm is different in that it is designed as a flexible core around whichto build constraint-based template search methods.

29. Gutteridge A, Bartlett GJ, Thornton JM: Using a neural networkand spatial clustering to predict the location of active sites inenzymes. J Mol Biol 2003, 330:719-734.

30. Bartlett GJ, Porter CT, Borkakoti N, Thornton JM: Analysis ofcatalytic residues in enzyme active sites. J Mol Biol 2002,324:105-121.

31. Luscombe NM, Greenbaum D, Gerstein M: What isbioinformatics? A proposed definition and overview of the field.Methods Informat Med 2001, 40:346-358.

32. Berman HM, Olson WK, Beveridge DL, Westbrook J, Gelbin A,Demeny T, Hsieh SH, Srinivasan AR, Schneider B: The nucleic-acid database - a comprehensive relational database of3-dimensional structures of nucleic-acids. Biophys J 1992,63:751-759.

33. Jones S, Barker JA, Nobeli I, Thornton JM: Using structural motiftemplates to identify proteins with DNA binding function.Nucl Acid Res 2003, 31:2811-2823.

34. McLaughlin WA, Berman HM: Statistical models for discerningprotein structures containing the DNA-binding helix-turn-helixmotif. J Mol Biol 2003, 330:43-55.

35. Stawiski EW, Gregoret LM, Mandel-Gutfreund Y: Annotatingnucleic acid-binding function based on protein structure.J Mol Biol 2003, 326:1065-1079.

36. Luscombe NM, Austin SE, Berman HM, Thornton JM: An overviewof the structure of protein-DNA complexes. Genome Biol 2000,1:1-37.

37. Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE,Burkhardt K, Iype L, Jain S, Fagan P, Marvin J et al.: The ProteinData Bank. Acta Cryst D 2002, 58:899-907.

38. Liang MP, Banatao DR, Klein TE, Brutlag DL, Altman RB:WebFEATURE: an interactive web tool for identifying andvisualizing functional sites on macromolecular structures.Nucl Acid Res 2003, 31:3324-3327.

39. Bagley SC, Altman RB: Characterizing the microenvironmentsurrounding Pprotein sites. Prot Sci 1995, 4:622-635.

40. Zemla A: LGA: a method for finding 3D similarities in proteinstructures. Nucl Acid Res 2003, 31:3370-3374.

41. Stark A, Russell RB: Annotation in three dimensions. PINTS:patterns in non-homologous tertiary structures. Nucl Acids Res2003, 31:3341-3344.

42.�

Sanishvili R, Yakunin AF, Laskowski RA, Skarina T, Evdokimova E,Doherty-Kirby A, Lajoie GA, Thornton JM, Arrowsmith CH,Savchenko A et al.: Integrating structure, bioinformatics,and enzymology to discover function - BioH, a newcarboxylesterase from Escherichia coli. J Biol Chem 2003,278:26039-26045.

One of the first papers to report on the assignment of function using acombination of theoretical and experimental techniques for a proteinstructure determined by a structural genomics consortium.

Searching for functional sites in protein structures Jones and Thornton 7


Documents

Searching for functional sites in protein structures