24
Functional Sites in Protein Families Uncovered via an Objective and Automated Graph Theoretic Approach Pramod P. Wangikar 1 *, Ashish V. Tendulkar 2 , S. Ramya 1 Deepali N. Mali 1 and Sunita Sarawagi 2 1 Department of Chemical Engineering, Indian Institute of Technology, Bombay, Powai Mumbai 400 076, India 2 Kanwal Rekhi School of Information Technology, Indian Institute of Technology Bombay, Powai Mumbai 400 076, India We report a method for detection of recurring side-chain patterns (DRESPAT) using an unbiased and automated graph theoretic approach. We first list all structural patterns as sub-graphs where the protein is represented as a graph. The patterns from proteins are compared pair- wise to detect patterns common to a protein pair based on content and geometry criteria. The recurring pattern is then detected using an auto- mated search algorithm from the all-against-all pair-wise comparison data of proteins. Intra-protein pattern comparison data are used to enable detection of patterns recurring within a protein. A method has been proposed for empirical calculation of statistical significance of recurring pattern. The method was tested on 17 protein sets of varying size, composed of non-redundant representatives from SCOP superfamilies. Recurring patterns in serine proteases, cysteine proteases, lipases, cupredoxin, ferredoxin, ferritin, cytochrome c, aspartoyl proteases, peroxidases, phospholipase A2, endonuclease, SH3 domain, EF-hand and lectins show additional residues conserved in the vicinity of the known functional sites. On the basis of the recurring patterns in ferritin, EF-hand and lectins, we could separate proteins or domains that are structurally similar yet different in metal ion-binding characteristics. In addition, novel recurring patterns were observed in glutathione-S-transferase, phospholipase A2 and ferredoxin with potential structural/functional roles. The results are discussed in relation to the known functional sites in each family. Between 2000 and 50,000 patterns were enumerated from each protein with between ten and 500 patterns detected as common to an evolutionarily related protein pair. Our results show that unbiased extraction of functional site pattern is not feasible from an evolutionarily related protein pair but is feasible from protein sets comprising five or more proteins. The DRESPAT method does not require a user-defined pattern, size or location of the pattern and therefore, has the potential to uncover new functional sites in protein families. Keywords: catalytic tetrad; active site; backtracking algorithm; branch and bound technique; protein structure Introduction Structural biology is making rapid strides in terms of the technology, precision and speed with which protein three-dimensional structures are becoming available via X-ray crystallography, 1 NMR 2–4 and molecular modeling. 5 – 11 The public domain Protein Databank (PDB) contains ca 18,000 entries for protein structures, 12,13 while the pro- prietary databases are likely to make noticeable contributions in the near future. 14 The possibility of predicting function, 15 active ligand-binding site, 16,17 phosphorylation site 18 and designing ligands/inhibitors 19 based on protein structure has

Functional Sites in Protein Families Uncovered via an Objective and

Embed Size (px)

Citation preview

Page 1: Functional Sites in Protein Families Uncovered via an Objective and

Functional Sites in Protein Families Uncovered via anObjective and Automated Graph Theoretic Approach

Pramod P. Wangikar1*, Ashish V. Tendulkar2, S. Ramya1

Deepali N. Mali1 and Sunita Sarawagi2

1Department of ChemicalEngineering, Indian Institute ofTechnology, Bombay, PowaiMumbai 400 076, India

2Kanwal Rekhi School ofInformation Technology, IndianInstitute of TechnologyBombay, Powai Mumbai400 076, India

We report a method for detection of recurring side-chain patterns(DRESPAT) using an unbiased and automated graph theoretic approach.We first list all structural patterns as sub-graphs where the protein isrepresented as a graph. The patterns from proteins are compared pair-wise to detect patterns common to a protein pair based on content andgeometry criteria. The recurring pattern is then detected using an auto-mated search algorithm from the all-against-all pair-wise comparisondata of proteins. Intra-protein pattern comparison data are used to enabledetection of patterns recurring within a protein. A method has beenproposed for empirical calculation of statistical significance of recurringpattern. The method was tested on 17 protein sets of varying size,composed of non-redundant representatives from SCOP superfamilies.Recurring patterns in serine proteases, cysteine proteases, lipases,cupredoxin, ferredoxin, ferritin, cytochrome c, aspartoyl proteases,peroxidases, phospholipase A2, endonuclease, SH3 domain, EF-hand andlectins show additional residues conserved in the vicinity of the knownfunctional sites. On the basis of the recurring patterns in ferritin, EF-handand lectins, we could separate proteins or domains that are structurallysimilar yet different in metal ion-binding characteristics. In addition,novel recurring patterns were observed in glutathione-S-transferase,phospholipase A2 and ferredoxin with potential structural/functionalroles. The results are discussed in relation to the known functional sitesin each family. Between 2000 and 50,000 patterns were enumerated fromeach protein with between ten and 500 patterns detected as common toan evolutionarily related protein pair. Our results show that unbiasedextraction of functional site pattern is not feasible from an evolutionarilyrelated protein pair but is feasible from protein sets comprising five ormore proteins. The DRESPAT method does not require a user-definedpattern, size or location of the pattern and therefore, has the potential touncover new functional sites in protein families.

Keywords: catalytic tetrad; active site; backtracking algorithm; branch andbound technique; protein structure

Introduction

Structural biology is making rapid strides interms of the technology, precision and speed withwhich protein three-dimensional structures arebecoming available via X-ray crystallography,1

NMR2 – 4 and molecular modeling†.5 – 11 The publicdomain Protein Databank (PDB) contains ca 18,000entries for protein structures,12,13 while the pro-prietary databases are likely to make noticeablecontributions in the near future.14 The possibilityof predicting function,15 active ligand-bindingsite,16,17 phosphorylation site18 and designingligands/inhibitors19 based on protein structure has

Page 2: Functional Sites in Protein Families Uncovered via an Objective and

led to significant excitement10,20 and has led to theinitiation of several ambitious structural genomicsprojects in different parts of the world.1,21,22

Two distinct types of approaches are in use forthe classification and analysis of protein 3D struc-tures. The first approach deals with classificationof proteins on the basis of overall structuralsimilarity23 – 25 aimed at identifying structuralneighbors, which in some cases may lead tocommon ancestral or functional relatedness.26,27

Examples of this approach are CATH28 (whichclusters proteins at four major levels, class, archi-tecture, topology and homologous superfamily),SCOP29,30 (structural classification of proteins), andFSSP (fold classification based on structure–structure alignment of proteins).24,31 – 33

The second approach is based on detection of aknown structural pattern of a small number ofamino acid residues (typically three to six) in theprotein structure. This is aimed at predicting func-tion by identifying a well-defined functional sitein a new protein structure. Several methods havebeen reported for this approach. Gregory et al.34

have described a way of generating query tem-plates using structural information from knownmetal-binding sites in proteins. This template isthen used to search for additional potential metal-binding sites in the PDB databank. Similarly,Wallace et al.35 have described a method for auto-matically deriving 3D templates for known activesites. In a graph theoretic approach, the computerprogram ASSAM searches for a user-definedpattern of a small number of amino acid residuesin a protein structure.36 The input pattern is con-sidered as a set of pseudo-atoms for the start (S),midpoint (M) and end (E) of side-chain of eachresidue and stored as distances between thepseudo-atoms (i.e. the distances SS, SM, MM, EE,etc.). This pattern is then compared against allpossible patterns of similar amino acid content inthe protein of interest. In addition to a method forsearching a user-defined structural pattern,Russell37 has described an approach for extractionof structural patterns common to a protein pair.

Sequence signatures, derived from visualinspection or automated analysis38 – 41 are a wellcharacterized feature of proteins. For example, thePROSITE,42,43 PRINTS44,45 and Blocks46 databasesprovide an extensive survey of family/sub-family-wise sequence signatures based on multiplesequence alignments. The highly conserved resi-dues in the sequence signatures are thought to beimportant for function. Alternatively, function canbe assigned to uncharacterized open readingframes (ORFs) based on detection of well-knownsequence signatures. Some of these sequencesignatures have been converted to structuralsignatures by superimposing the correspondingsub-structures in an attempt to define a 3Dfunctional pattern.47

The reported methods such as ASSAM36are use-ful for searching for a known pattern in a proteinstructure(s). However, their applicability is limited

to the known functional patterns. Russell’smethod37 of extraction of patterns common to aprotein pair is limited, as several hundred commonpatterns emerge even in an evolutionarily relatedprotein pair, which will necessitate a significantmanual intervention to identify the biologicallyrelevant patterns. We show that Russell’s method37

suffers from a high false negative rate at lowRMSD cutoffs and a high false positive rate athigh RMSD cutoffs. The PROSITE/PRINTSsequence signatures typically do not provide allthe functional residues in a single motif as thefunctional site residues may be located far awayfrom each other on the primary sequence.

To date, characteristic functional site patternshave been reported for a large number of proteinfamilies. Uncovering of such sites typicallyinvolves site-directed mutagenesis, binding of irre-versible inhibitors, spectroscopic studies andexamination of X-ray crystal structures. Histori-cally, the functional sites have been detectedexperimentally in one member of the family andsubsequently inferred for others on the basis ofsequence/structure alignments. For example, theSer/His/Asp catalytic triad was first discoveredin chymotrypsin48 and then subsequently found inother “trypsin-like” and “subtilisin-like”hydrolases35,49,50 and lipases.51,52 Aspartoyl pro-teases are characterized by two Asp residues in acomplex network of hydrogen bonds with otherresidues.53 Among the other families consideredhere, the blue copper proteins are electron transferproteins with a metal co-ordination site composedof three strong equatorial ligands Cys, His, Hisand one weak axial ligand, Met, Phe, Leu or Gln.54

This is a distorted tetrahedral geometry. Theclassical zinc finger, one of the biggest families oftranscription factors, is characterized by a Cys2

His2 metal co-ordination site.55 EF-like proteinsshare a helix-loop-helix motif rich in Asp and Gluresidues and bind a calcium ion. Large numbersof such functional site patterns have been reportedand we believe that many more are yet to beuncovered. With the availability of a large numberof non-redundant PDB structures for each family,we decided to carry out an objective assessment ofthe recurring patterns in protein families.

The method presented here detects structuralpatterns of small numbers of amino acid residuesrecurring in k proteins given an input of n proteins.The method generates all possible structural pat-terns in all the protein structures considered, andthen detects the most frequently recurring patternon the basis of content and geometric similarity.The patterns that are generated are independentof the primary sequence, which eliminates therequirement for sequence alignments. The methoddoes not require any user-defined pattern andhence it is useful in detecting new patterns. The17 protein sets in this initial study have beenchosen on the basis of a common functional role,which can potentially contain a common functionalsite. In addition to confirming the known

956

Page 3: Functional Sites in Protein Families Uncovered via an Objective and

functional site patterns in each protein set, wedetected novel patterns in three protein sets anddetected additional residues in the vicinity ofknown functional sites in 14 sets. The potentialapplications of the method and implications of theresults are discussed.

Algorithm

The input is the 3D structures of n proteins inPDB format. The adjustable parameters includethe distance deviation cut-offs (for Ca, Cb, andfunctional atoms), the RMSD cut-off, the preferredsize of the pattern if any (if the preferred size isnot specified then all the patterns made up ofthree to six amino acid residues are considered),and k, the minimum recurrence frequency for apattern to be reported. The algorithm consists ofthree broad steps: (i) processing of PDB files fordetection of patterns; (ii) pairwise comparison ofpatterns; and (iii) detection of recurring patternsin a family of proteins. The details of the varioussteps of the algorithm are provided below.

Graph theoretic representation of protein 3Dstructure and detection of patterns

A protein 3D structure is first represented as alabeled and weighted graph GðV;EÞ where thevertices, V, are the functional atoms from the side-

chains of the amino acid residues (Figure 1). Anedge exists between two vertices if they are withininteracting distance (12 A). The vertices have labels(amino acid type) while the edges have weights(distances). Only one functional atom is consideredper amino acid residue37 in this graph theoreticrepresentation. Also, residues with only hydrogenand carbon atoms in their side-chains (Ala, Phe,Gly, Ile, Leu, Pro and Val) are not considered inthe analysis. These amino acid residues werefound to have a low probability of being presentin functional sites.31 Likewise, cysteine residuesinvolved in disulfide bonds are ignored in theanalysis. The graph is stored in the form of anadjacency matrix, which is generated by first calcu-lating the distance matrix and assigning truevalues for those pairs of nodes whose distance isless than 12 A. The distance matrix for thefunctional atoms is generated readily from theco-ordinate information available in the PDB file.

Detection of patterns

A structural pattern is a complete sub-graphcontaining three to six nodes. Step 1 involvesdetection of maximal complete subgraphs fromthe graph GðV;EÞ by applying a backtrackingbranch and bound algorithm as described.56 Thebacktracking algorithm requires input of thegraph in the form of an adjacency matrix

Figure 1. Schematic for detection of patterns of three to six amino acid residues (exemplified for 1cus): Graphtheoretic representation of the protein as a graph GðV;EÞ where the vertices, V, are the functional atoms from theside-chains of the amino acid residues. An edge exists between two vertices if they are within interacting distance(12 A). Step 1 involves detection of maximal complete subgraphs from the graph GðV;EÞ by applying a backtrackingalgorithm employing a branch and bound technique.56 A pattern as defined here is a complete subgraph containingthree to six nodes. Thus, a pattern typically is subsumed within a maximal complete graph. Therefore, step 2 involvesenumeration of all possible patterns from the maximal complete subgraphs. One functional atom is used per aminoacid residue as described.37 The graph label is the PDB code for the protein (e.g. 1cus), while the vertex label is thechain number, amino acid residue position number and type (e.g. A102Ser).

957

Page 4: Functional Sites in Protein Families Uncovered via an Objective and

indicating the presence/absence of edges betweenvarious nodes. The algorithm first generates allmaximal cliques. Then all possible patterns ofsize greater than 3 are enumerated from thismaximal clique. An index is assigned to a patternto denote the protein PDB code and labels of allthe nodes that constitute the pattern. The patternsare listed and stored in a separate file for eachprotein.

Criteria for comparison of patterns fromtwo proteins

For the purpose of comparison of structure, eachpattern is considered as a geometric object madeup of the Ca and Cb atoms for each amino acidresidue, in addition to the functional atoms asdescribed above (Figure 2). Patterns from twoproteins are first compared for amino acid compo-sition. The superposition of the two patterns thenrequires superposition of the respective atoms,which is evaluated by comparing the respectiveCa–Ca, Cb–Cb, and functional atom-functionalatom distances of one pattern with those of secondpattern. The three distance matrices for the Ca, Cb

and functional atoms are pre-computed andstored. The deviation in the respective distances ischecked against the user-defined tolerance value.The comparison returns a true value only if all thedeviations and the RMSD of the distances of thefunctional atoms are less than the respectivetolerance values. The RMSD calculations weremade following the procedure described byRussell.37 The patterns found to be common intwo proteins are listed and this exercise is repeatedfor all the proteins to obtain an all-against-all com-parison of patterns from the proteins underconsideration.

Detection of recurring pattern

A recurring pattern is defined as the pattern thatis present in at least k (an input parameter)proteins such that for any pair of proteins therespective patterns match based on the criteria ofthe previous section. Our goal is to find all suchrecurring patterns in the given set of n proteins.This problem can be handled very easily usingthis graph theoretic representation of the patterns.A graph FðC;WÞ is constructed with the verticesas the patterns found in each protein (Figure 3).An edge, W, exists between two patterns only ifthey are found similar using the criteria describedabove. The ðF;WÞ graph is then subject to anotherrun of the clique detection algorithm.56 The outputof this algorithm is all maximal cliques in thegraph. The cliques that contain at least k verticesare those that correspond to a recurring pattern.As one can imagine, this operation can be compu-tationally very intensive if carried out as statedhere. Our implementation uses a number of tech-niques to make this efficient. We describe some ofthese: (1) the maximal complete subgraph detec-tion algorithm is executed on each type of patternseparately. For example, the patterns containingSer/His/Asp are compared with patterns of identi-cal content. Thus, an adjacency matrix is createdfor each type of pattern (say Cys/Cys/His, Asp/His/Ser, Cys/His/Asn, etc.) (Figure 4(B)). Thetotal number of pattern types is 2197 (133) forpatterns containing three amino acid residues andeven larger for patterns containing four to sixresidues and hence the backtracking algorithmmay still require a significant computationalpower. (2) We use the value of k that dictates theminimum size of a clique, to make the search forthe maximal cliques efficient. A given graphF(C, W) can contain a complete subgraph of at

Figure 2. Schematic for comparison of two patterns.

958

Page 5: Functional Sites in Protein Families Uncovered via an Objective and

least k nodes only if the graph F(C,W) contains atleast k nodes that have degree at least equal to k.If this criterion is not satisfied, the graph can beomitted from the analysis and one can go to thenext pattern type. Further pruning can be done byislanding and eliminating the nodes that are notlikely to be part of a k maximal complete subgraphand then counting the degree of the nodes andapplying the above pruning criteria in a repeatedmanner. The backtracking algorithm is appliedonly if a satisfactory graph is left after the abovepruning steps. The algorithm returns the patterntype (SST, DHS, etc.) and the protein identities inwhich the pattern is recurring along with thepattern indices.

Protein set

A protein set is comprised of non-redundantrepresentatives from a SCOP superfamily.29,30 Themembers of the SCOP superfamily are thought to

be evolutionarily/functionally related. A total of128 such sets was created with a minimum of tenPDB structures per set.

Estimation of statistical significance of arecurring pattern

The statistical significance is estimated by usingn proteins chosen at random from a protein setand counting the number of distinct patterns, E,recurring in k proteins. An average value of E,taken over a large enough number of runs byrandomly selecting a protein set (typically between20 and 50) is plotted against k. The statistical sig-nificance is dependent on the number of proteinsin the input, n, the recurrence frequency, k, andthe size of the pattern, s.

Results

Detection of patterns

As a first step in the DRESPAT (detection ofrecurring side-chain patterns) program, ca 3000PDB structure files of non-redundant representa-tives from 128 SCOP superfamilies were processedto extract the patterns form each protein. Thesteps involved are described in detail in Algorithm.For each PDB file, four processed files were gener-ated to store: (i) the list of patterns detected in thegiven protein, and the atom to atom distancematrices for; (ii) the functional atoms; (iii) the Ca

atoms; and (iv) the Cb atoms. Between 2000 and50,000 patterns were detected in each protein.Examination of the plot of number of proteinsversus the size of the protein (Figure 4) reveals thatcertain proteins contain substantially larger (orsmaller) number of patterns for their sizecompared to an average protein of the same size.Clearly, the number of patterns detected is depen-dent on the quality of protein packing and the

Figure 4. Number of patterns detected versus thenumber of amino acid residues in the protein sequence.The results are based on an analysis of 1094 PDB filesand are restricted to the X-ray crystal structures.

Figure 3. Algorithm for detectionof recurring patterns in a proteinfamily. A protein family is rep-resented as a graph FðC;WÞ wherethe vertices are the patterns, C,from all the proteins under con-sideration and an edge, W, existsbetween two patterns if they arefound equivalent/similar. The setof vertices, C, can be partitionedinto a number of subsets, which isequal to the number of proteins inthe family. Detection of the mostfrequently recurring pattern isachieved by identifying the maxi-mal complete subgraph(s) from thegraph FðC;WÞ by applying a back-tracking algorithm employingabranch and bound technique.56

959

Page 6: Functional Sites in Protein Families Uncovered via an Objective and

amino acid composition in addition to the proteinsize. A discussion on the dependence of thenumber of patterns on various parameters of theprotein is beyond the scope of this paper. For pro-teins containing multiple chains, a fraction of thepatterns were found to reside at the interface ofthe two chains. For example, the pattern 3SC2DB338 HB397 SA146 contains Asp and His fromchain B, and Ser from chain A. Thus, the programcan detect patterns at the protein–protein interface.

Relative location of the patterns compared tothe protein centroid

Hydrophobic amino acid residues are ignored inthe present pattern detection algorithm. Thus, ourmethod is biased toward detecting patterns ofpolar/charged residues, which typically are foundcloser to the protein surface. As an example, forprotein structure 1CUS, our analysis on distri-bution of distance between the pattern centroidand the protein centroid indicate that the majorityof the patterns are 8–20 A from the proteincentroid (Figure 5). A similar trend is observed forother proteins. The protein centroid was chosen asan approximate location of the hydrophobic core,as it is difficult to pinpoint the exact location ofthe hydrophobic core without manual intervention.In fact, no pattern was detected within 5 A fromthe protein centroid.

Pair-wise comparison of proteins, falsepositives and false negatives

An analysis of pair-wise pattern comparisondata of evolutionarily related protein pairs wascarried out. As an example, we show the resultsfor all possible protein pairs from the zinc-fingersuperfamily of the SCOP database,29,30 where thecharacteristic metal-coordination site is a Cys/Cys/His/His tetrad. In addition to the Cys/Cys/His/His tetrad, a large number of tetrads appearcommon to a protein pair at RMSD cutoffs of1.0 A (Figure 6) as well as 0.1 A (data not shown).In fact, no protein pair returned the functional sitepattern as the only pattern common to the pair. Toanalyze the success rate of pair-wise comparisonresults in extracting the functional site pattern asthe sole common pattern, we categorized theresults as: (i) true positives (the pattern pair that isreported to be similar and is seen as similar in pat-tern comparison); (ii) false negatives (the patternpair that is known to be similar but is seen as notsimilar in pattern comparison); (iii) false positives(any pattern pair other than the reported functionalsite pattern but detected as similar in the results).Our results show a very high false positive ratebut a low false negative rate at RMSD cutoff of1.0 A (Figure 7(a)). Clearly, the large number ofdetected similar patterns in an evolutionarilyrelated protein pair is a consequence of thesequence/structure similarities between the pro-tein pair. In other words, it would not be possible

to select the functional site pattern from the largenumber of detected patterns in an objectivemanner. At RMSD cutoff of 0.1 A, high falsenegative rate is observed in addition to a highfalse positive rate. Thus, many of the reportedsimilar patterns are not detected, probably becausethe experimentally determined structures typicallyhave a resolution range of ca 1.0–3.0 A. Similarresults are shown for cupredoxin (Figure 7(b)) anda-amylase (Figure 7(c)) superfamilies with thefunctional site patterns of Cys/His/His/Metand Asp/Asp/Glu/His/Tyr, respectively. Similar

Figure 5. Distribution of the distance between theCOM of patterns and the COM of the protein for sixrepresentative proteins.

Figure 6. Distribution of number of patterns detectedas common to an evolutionarily related protein pair. Thenumber of tetrads that were detected as common to twoproteins chosen from the non-redundant representativemembers of a given SCOP superfamily was counted.The distribution is generated by repeating this exercisefor all possible protein pairs of that superfamily. Theresults are shown for SCOP superfamilies of zinc fingerand cupredoxin with 91 (14C2) and 465 (31C2) proteinpairs, respectively.

960

Page 7: Functional Sites in Protein Families Uncovered via an Objective and

results were observed for other protein sets (datanot shown).

The golden pattern

It is logical to expect that while there are a largenumber of common patterns in a protein pair,fewer patterns could be common in a largerprotein set. Thus, it is of interest to determine the

minimum number of PDB structures that arerequired in a protein set to extract the goldenpattern, the functional site pattern, in an objectivemanner. Although a precise answer depends onthe size and divergence of proteins, we haveattempted to obtain an empirical estimate by sub-mitting smaller sets, with three to nine proteins, tothe DRESPAT program. For example, between twoandseven tetrads are detected for sets comprising

Figure 7. Analysis of pair-wisecomparison data of evolutionarilyrelated protein pairs. For a givenprotein set, common patternsdetected from all possible proteinpairs were categorized into truepositives (known functional sitepatterns detected as common inprotein pair), false negatives(known functional site patterns butnot detected as common in proteinpair) and false positives (otherpatterns detected as common inprotein pair). The data are shownfor (a) zinc finger, (b) cupredoxinand (c) a-amylase protein sets. Theexpected true positives in thesesets are 406 (29C2), 666 (37C2) and190 (20C2), respectively. Based onreports in the literature,54,85,104 theknown functional site patternsconsidered are Cys/Cys/His/His(zinc finger), Cys/His/His/Met(cupredoxin) and Asp/Asp/Glu/His/Tyr (a-amylase).

961

Page 8: Functional Sites in Protein Families Uncovered via an Objective and

three or four proteins from the zinc fingersuperfamily (Table 1). However, only one tetrad,Cys/Cys/His/His, is detected when five or moreproteins were submitted. For the cupredoxinsuperfamily, the Cys/His/His/Met/Asn patternwas detected as the sole pattern even with as fewas three proteins in the input. The results for thesetwo families, as well as other families (data notshown), indicate that the DRESPAT program canprovide useful results with sets consisting of fiveor more proteins.

Statistical significance of the recurrencefrequency of a pattern

It is important to collect the statistics of recurringpatterns, since the protein sets are derived fromSCOP superfamilies, whose members are thoughtto be evolutionarily/functionally related withsignificant structural similarity. Toward this end, nproteins, chosen at random from a SCOP super-family, were used as input for the DRESPATprogram to estimate the number of distinctpatterns, E, with a frequency of recurrence k. Atotal of 128 SCOP superfamilies were used Anaverage value of E was obtained by repeating theruns for randomly chosen SCOP superfamilies.

The E value is dependent on the number of pro-teins in the input (n), and the size of the pattern(triad, tetrad, pentad, etc.). Thus, to exemplify, thedependence of E on k (recurrence frequency) isshown for different values of n for tetrads (Figure8a) and pentads (Figure 8(b)). In all the cases con-sidered, the E value decreases exponentially withincreasing k with R 2 values of greater than 0.98 forthe exponential fit. The statistical data collectedfor different values of n was fit to obtain an empiri-cal correlation between the E value and the valuesof n and k for tetrads (equation (1)), pentads(equation (2)) and hexads (equation (3)):

EðtetradsÞ ¼ ð3:9 £ 103n þ 5:1 £ 104Þ

exp {ð0:0261 n 2 1:28Þk} ð1Þ

EðpentadsÞ ¼ ð3:4 £ 103n 2 1:53 £ 104Þ

exp{ð0:037n 2 1:78Þk} ð2Þ

EðhexadsÞ ¼ ð1:04 £ 103n 2 3:4 £ 103Þ

exp{ð0:032n 2 2:01Þk} ð3Þ

These equations can be used to obtain statistical

Table 1. Dependence of DRESPAT results on the size of the protein set

No of proteins (n) Protein set used as input for DRESPAT (PDB codes) Patterns of highest recurrence frequency in the protein set

A. Zinc finger3 1bhi, 1zaa, 1ubd CCHH, CHHT

1rmd, 1tf6, 1yui CCHH, CCHR1zfd, 1ard, 4znf CCHH, CHHS, CCHS1sp2, 2drp, 2gli CCHH, CCHK, CCHS, CCHT, CHHK, CHHM, CHHT,

CHKM, CHMT, HHKM4 1ard, 1sp2, 1ubd, 2gli CCHH, CCHT, CHHT

1bhi, 1tf6, 2drp, 5znf CCHH, CCHK, CHHT1sp2, 1ard, 1zfd, 1zaa CCHH1rmd, 1tf6, 1yui, 4znf CCHH

5 1ard, 1tf6, 1yui, 1zfd, 2gli CCHH1ubd, 1rmd, 2drp, 1sp2, 4znf CCHH1tf6, 1bhi, 4znf, 1zfd, 1zaa CCHH1sp2, 1zfd, 1rmd, 2gli, 1ard CCHH

6 1rmd, 1sp2, 1tf6, 1ubd, 1yui, 1zaa CCHH1yui, 1zaa, 2drp, 5znf, 1ard, 1zfd CCHH2drp, 5znf, 2gli, 1tf6, 1ard, 1zfd CCHH, CCHS

7 2drp, 5znf, 2gli, 1tf6, 1ard, 1zfd, 1ubd CCHH1ard, 1bhi, 1zaa, 1zfd, 4znf, 5znf, 1rmd CCHH

8 1ard, 1bhi, 1zaa, 1zfd, 2drp, 4znf, 5znf, 1rmd CCHH1ubd, 1yui, 1zaa, 2gli, 4znf, 5znf, 1sp2, 1tf6 CCHH

9 1rmd, 1sp2, 1tf6, 1ubd, 2drp, 2gli, 4znf, 5znf, 1bhi CCHH

B. Cupredoxin3 1aac, 1bqr, 1iuz CHHMN4 1aac,1azc, 1azu, 1joi CHHMM

1aac,1azc, 1azu, 1joi CHMNT5 1bxv, 1dyz, 1iuz, 1plc, 2plt CHHMN6 1azu, 1bq5,1bxv, 1dyz, 1iuz, 1joi CHHMN7 1aac,1azc, 1bq5, 1bxv, 1cuo, 1dyz, 1iuz CHHMN8 1aac, 1bq5, 1bqr, 1bxv, 1byp, 1dyz, 1iuz, 1joi CHHMN

From the zinc finger protein set, smaller sets were constructed by randomly selecting from three to nine PDB structures. Patterns ofhighest recurrence frequency are reported for each protein set. A similar procedure was followed for the cupredoxin protein set. ForRMSD and deviation cutoffs, refer to the footnote to Table 2.

962

Page 9: Functional Sites in Protein Families Uncovered via an Objective and

significance of recurring patterns in test families. Itmay be noted that the empirical equations (1)–(3)are fit using n values between 10 and 70 and for kvalues of greater than 2. Hence, we caution againsttheir use outside the prescribed range.

Results for functional role families

A total of 128 protein sets were created by select-ing non-redundant representatives from SCOPsuperfamilies. We selected 17 protein sets of vary-ing size in this initial study. Results of the DRE-SPAT program and the relevance to the knownfunctional sites are presented below. The resultshave been broken into three artificial categories onthe basis of the size of the protein set.

Large protein sets (more than 50 proteins)(Table 2)

Ser/His/Asp/Thr tetrad in serine proteases

Pattern search programs have now detected theSer/His/Asp triad in all serine hydrolases.35,36

Additional catalytic machinery involves alignmentof one or two hydrogen bond-donating groupstowards the likely position of the negativelycharged oxygen atom (oxyanion hole) for stabiliz-ation of the tetrahedral transition state. In serinehydrolases, these hydrogen bond-donating groupsare generally peptide nitrogen atoms, one ofwhich belongs to the catalytic Ser residue.57 Thus,it is logical to expect that the program DRESPATwould return the Ser/His/Asp catalytic triad asthe most frequently recurring structural pattern. ASer/His/Asp/Ser/Thr pentad was detected in 42proteins, while a subset of the pentad, a Ser/His/Asp/Thr tetrad, was found to recur in all the 60proteins with potential biological significance ofthe Thr residue detected in addition to the well-known triad. The side-chain of the Thr residue iswithin hydrogen bonding distance of the oxyanionhole. Figure 9(a) shows the Ser/His/Asp/Ser/Thrpentad superimposed from three representativestructures with the additional Ser and Thr residueswithin hydrogen bonding distance of the putativeoxyanion. Catalytic tetrads composed of Ser/His/Asp/Ser or Ser/His/Asp/Cys have been reportedin serine proteases.35,37,58 – 60 However, we detecteda Ser/His/Asp/Thr tetrad with a greaterrecurrence frequency.

EF-hand

These calcium-binding proteins share a helix-loop-helix motif and have a seven oxygen atomcalcium co-ordination sphere61 with an approxi-mate pentagonal bipyramidal geometry employingeither amino acid side-chain carboxylate, amideand/or hydroxyl groups, a main-chain carbonyloxygen atom and/or a water molecule.61 Wedetected Asp/Asp/Asp/Glu/Glu/Lys/Asn/Serpattern in 42 of the 65 proteins with residuepositions matching with some of those reportedfor the calcium co-ordination sphere.61 Thepattern detected along with the bound metal ionis shown for two representative proteins(Figure 9(b)). Our results reflect accurately thenumber of calcium-binding sites in each protein.For example, PDB structure 1jba has four EF-likemotifs but has only two calcium-binding sites.62

We detect these two sites accurately. Also, severalproteins have lost the ability to bind calciumdespite the presence of an EF-like motif. Forexample, Cdc4p contains four EF-hand motifs butdoes not bind calcium (1ggw).63 We did notdetect the above pattern in several such proteinsthat do not bind Ca2þ, including Cdc4p (PDB code1ggw).

Figure 8. A plot showing statistical significance (E orexpectation value) of a recurring pattern. The plot isgenerated by running the program DRESPAT on a set ofevolutionarily related proteins. The number of distinctpatterns, E, each pattern expected to recur in k proteinswhen N proteins, chosen at random from the non-redundant representative members of a SCOP super-family, are used as input. The E value is obtained as anaverage over 20 runs with a randomly selected SCOPsuperfamily for each run. (a) tetrads; (b) pentads.

963

Page 10: Functional Sites in Protein Families Uncovered via an Objective and

Table 2. Recurring patterns observed in large protein sets with more than 50 proteins

Structural patterna PDB identification numbers with the residue positions for the detected patternb

A. Serine proteasec

Asp/His/Ser/Ser/Thrk ¼ 60, n ¼ 60 1a0j-a, 1gj7-b, 1h4w-a, 1a5i-a, 1f5k-u, 1ij7, 3ptb-a, 1ane-a, 1a5h-a: 102/57/195/214/229; 3sgb-e, 1p02-a, 2alp-

a, 3lpr-a, 1ao5, 2hlc-a, 1gmc-a, 2sga-a, 1sgc-a, 5sga-e, 5cha, 4cha, 1azz-a,c, 2kai, 2pka, 1ab9, 1acb, 1bbr, 1sgt-a,3hat-h, 1euf-a, 1fuj-a, 1klt-a, 1thr-a, 1npm-a, 3est-a, 1qnj-a: 102/57/195/214/54; 1a1r-a: 107/83/165/–/80;1elc-a, 1esa-a: 108/60/203/222/57; 1arc-a, 1arb-a: 113/57/194/–/54; 1agj-a: 120/72/195/211/69; 1st3-a,1svn-a: 32/62/215/–/33; 1gci-a, 1sca-a, 1cse-e: 32/64/125/–/33 (71); 2sec, 1dui-a: 32/64/221/–/71; 1sup-a:32/64/63/–/66; 1thm-a, 1tec: 38/71/225/–/228; 2pkc-a, 2prk-a: 39/69/224/–/227; 1dbi-a: 39/72/226/–/40; 1a0h-b: 419/363/525/546/360; 1elv-a: 514/460/617/639/457; 2sfa-a: 65/35/147/162/32; 1bef-a: 75/51/135/–/48; 1dxp-a: 81/57/139/–/54

B. EF-hand

Asp/Asp/Asp/Glu/Glu/Lys/Asn/Serk ¼ 42, n ¼ 65 1jfj:10/14/–/21/–/–/12/18, 117/121/–/128/–/–/119/–, 46/48/–/57/–/–/50/–; 1aui-b:101/103/99/

110/–/–/–/107, 140/142/144/151/–/141/–/148, 62/64/70/73/68/72/66/–; 1hqv:103/105/111/114/–/–/–/–, 36/38/–/47/–/37/–/44; 1tn4:103/107/111/114/–/–/105/–, 139/143/147/150/–/140/141/–, 27/29/33/38/–/–/–/–, 56/63/65(71)/64(74)/–/–/–/–; 2scp:104/108/–/115/–/–/106/112,138/142/–/149/–/–/140/146; 1jba:105/107/–/116/–/106/109/–, 151/162/–/169/–/–/160/166;1fi5:105/113(109)/145/116/–/106/107/–, 141/145/149/152/–/–/143/–; 1ncx:106/110/114/117/–/–/108/–, 142/146/150/153/–/143/144/–; 5tnc:106/110/114/117/–/107/108/–, 142/146/150/153/–/143/144/–; 1g8i:109/111/113/120/–/163/–/–, 73/77/–/84/81/76/75/–; 1rec:110/112/–/121/–/–/114/118;2sas:115/119/123/126/–/–/–/–, 70/74/–/81/–/73(66)/72/78; 1ahr:129/131/133/139/140/–/–/–, 20/22/24/31/–/21(30)/–/–, 56/58/64/67/–/–/60, 93/95/–/104/–/94/97/101; 1alv:150/152/–/161/–/156/–/151, 180/182/–/191/–/–/–/190; 1dgu:153/155/157/164/–/–/161/152; 1dvi:154/155/–/165/–/160/–/155; 1fpw:157/161/–/168/–/158/159/–, 73/75/–/84/–/–/77/–, 109/113/–/120/–/–/111/–;3cln:20/22/–/31/–/21/–/–, 131/133/–/140/–/21/–/–, 58/56/64/67/–/–/60/–, 93/95/–/104/–/–/97/–; 1sra:257/259/267(261)/268/–/–/–/–,222/227/–/234/–/–/–/231; 1rro:51/53/59/62/–/69/–/–,90/92/94/101/–/–/91/–; 1cdp:51/53/61/54/62/96/–/91, 90/92/94/101/–/96/–/91; 1pvb:51/53/61/59/62/–/–/–, 90/92/94/101/–/91/–/–; 1bu3:51/53/61/59/62/54/–/–, 90/92/94/101/–/–/–/91;1rtp:511/531/611/591/621/521/–/–, 901/921/941/1011/–/911(961)/–/–; 4icb:54/58/–/65/–/55/56/62;1lin:56/58/64/67/–/–/60/–, 129/131/133/140/–/–/–/–, 20/22/24/31/–/21(30)/–/–, 93/95/–/104/–/94/97/101; 4cln:56/58/64/67/–/–/60/–, 129/131/133/140/–/–/137/–, 20/22/24/31/–/21/–/–, 93/95/–/104/–/–/97/101; 1cll:56/58/64/67/–/–/60/–, 129/131/133/140/139/–/137/–, 20/22/24/31/–/30/–/–; 1exr:56/58/64/67/–/–/63/–, 129/131/133/140/–/–/–/–, 20/22/24/31/–/21/–/–;1eh2:58/60/62/69/–/99/–/57; 1df0:585/587/–/547/596/591(595)/–/–; 1mr8:59/63/–/70/–/–/–/61/–; 1e8a:61/65/69/72/–/–/63/–; 1tco:62/64/70/73/–/72/66/36, 101/103/99/110/–/–/–/107, 140/142/144/151/–/141/–/148; 1psr:62/66/70/73/–/68/64/–; 1ap4:65/67/–/76/–/–/–/–/–; 1iq3:66/68/70/77/–/–/–/65; 1qls:66/68/70(74)/77/–/–/–/–; 1irj:67/71/–/78/–/–/69/75; 1bjf:73/77/81/84/–/–/75/–, 157/161/–/167/168/–/159/165, 109/111/–/120/–/163/113/117;1a75-a:90/92/94/100/101/–/–/91, 51/53/61/62/59/–/–/–; 5pal:90/92/94/101/–/96(104)/–/91,51/53/59/62/96/–/–/–; Pattern not found in: 1a4p, 1b47, 1cb1, 1cfd, 1cfp, 1cnp, 1eg3, 1ej3,1el4, 1f4o, 1ggw,1h8b, 1j7q, 1jf0,1juo,1k2h, 1kfu,1qas, 1qjt, 1sym, 1uwo, 1wdc, 2mys

C. Concanavalin A-like lectins/glucanases

Asp/Asp/Asp/Glu/His/Asn/Ser/Thrk ¼ 22, n ¼ 63 1azd-a, 1h9p-a, 1nls-a:10/19/28/8/24/14/34/37; 1les-a, 1lgc-a, 1loe-a, 2ltn-a:121/129/140/119/136/125/

146/149; 2pel-a:123/132/141/121/137/127/147/150; 1wbl-a:124/131/140/122/136/128/146/149; 1g8w-a:124/132/141/122/137/128/147/150; 1f9k-a:125/132/141/123/137/129/147/150; 1lul-a:125/133/142/123/138/129/148/151; 1fny-a:127/135/–/125/140/131/150/153; 1fx5-a:128/138/147/126/143/135/153/–;1qnw-a:128/139/148/126/144/136/154/157; 1lte-a:129/136/146/127/142/133/152/155; 1dbn-a:129/140/149/127/147/–/155/158; 1led-a:131/140/149/119/145/135/155/158; 1sbd-a:133/126/–/124/138/130/148/151; 1qmo-a:140/149/158/138/154/144/164/167; 1dgl-a:19/28/–/8/24/14/34/–; 1gpi-a:246/248/251/212/–/–/171/–; Pattern not found in: 1a3k, 1af9, 1axk, 1b09, 1bk1, 1c1f, 1c4r, 1cpn, 1d2s, 1eg1, 1epw,1f5j, 1gan, 1gbg, 1gnz, 1h8v, 1hix, 1hlc, 1ikq, 1ioa, 1jhn, 1kit, 1nlr, 1ovw, 1pvx, 1qh6, 1qkq, 1qmj, 1qu0, 1sac,1sll, 1slt, 1ukr, 1xnb, 1xnd, 1xyn, 1xyo, 1yna, 2a39, 2ayh, 3bta

Non-redundant representative members of a superfamily were chosen from the SCOP database29,30 to create a protein set unlessmentioned otherwise. Recurring patterns of highest statistical significance from each protein set are reported without using a biasabout the known functional sites in the protein set. The various deviation and RMSD cutoffs used during the DRESPAT run are asfollows: Ca and Cb deviation cut-off: 6.5 A and 5.5 A respectively and the RMSD cutoff of 1.0 A.

a k and n are recurrence frequency and protein set size (number of proteins), respectively.b The PDB codes are separated by commas followed by the residue positions when several non-redundant PDB structures shared

the residue positions of the recurring patterns. Likewise, multiple patterns, when present in one PDB structure, are separated bycommas. The PDB structures in which a given pattern was not detected are reported against each pattern in order to complete thelist of PDB structures used in a give protein set.

c The serine protease protein set was constructed using representative members from trypsin like superfamily and subtilisin-likesuperfamily.

964

Page 11: Functional Sites in Protein Families Uncovered via an Objective and

Concanavalin A-like lectins/glucanases

Lectins are a class of proteins with polyvalentaffinity for oligosaccharides and are present ashomodimers/tetramers. Their carbohydratespecificity is similar to antigenic recognition byantibodies. Lectins are important due to theirapplication in identifying blood group-determin-ing oligosaccharides. Lectins typically bind twometal ions per subunit:64 one Ca2þ and one Mn2þ.

From the superfamily of 65 legume lectins, animalgalectins and glucanases, we detected a Asp/Asp/Asp/Glu/His/Asn/Ser/Thr pattern in 22 legumelectins. The patterns match with the proposed Ca2þ

and Mn2þ co-ordination sites in lectins (Figure 9(c)),where the two ions are reported to be 4.8 A apart inlectin I from Ulex europaeus.64– 66 This pattern wasnot detected in any of the glucanases or any “non-legume lectin” member of the superfamily. This isconsistent with the reported lack of Ca2þ and Mn2þ

Figure 9 (legend on page 967)

965

Page 12: Functional Sites in Protein Families Uncovered via an Objective and

ligand-binding sites in glucanases as well as galec-tins (mammalian lectins).67,68

Medium-size protein sets (between 16 and 50proteins) (Table 3)

Ser/His/Asp/Thr tetrad in lipases

Lipases (triacyl glycerol hydrolase; EC 3.1.1.3)are used widely to catalyze hydrolysis and esterifi-

cation reactions of a wide variety of substrates. Inaddition to the Ser/His/Asp(Glu) catalytic triad,data have been accumulated on the oxyanion hole,the lid and other residues involved in catalyticactivity and selectivity of lipases.69,70 In lipasesfrom filamentous fungi, a Ser or Thr side-chain ispostulated to add to the third hydrogen atom tothe stabilization of the oxyanion.71 Our results con-firm Ser/His/Asp as the most frequently recurringpattern in lipase structures with an additional Thr

Figure 9 (legend opposite)

966

Page 13: Functional Sites in Protein Families Uncovered via an Objective and

detected as a recurring residue in the vicinity of thecatalytic triad. The Thr side-chain is within hydro-gen bonding distance of the oxyanion and poten-tially plays a role in the transition statestabilization (Figure 9(d)).

Cys/His/Asn(Asp)/Gln tetrad in cysteine proteases

In the majority of cases, the catalytic triads ofcysteine proteases include an Asn rather than anAsp residue. An Asp residue has been found inthe triad in arylamine N-acetyltransferase72and ahuman deubiquitinating enzyme, UCH-L3.73

Based on structural and mutational studies, theoxyanion is proposed to be stabilized by hydrogenbonding with peptide nitrogen atom of catalyticcysteine and amide nitrogen atom of a Gln residuein papain as well as in human cathepsin K.74,75

However, an equivalent Gln residue is missingfrom some members of the cysteine protease super-family, such as the arylamine N-acetyltransferaseenzymes from Salmonella typhimurium andMycobacterium smegmatis.72,76 In addition, severalTrp residues are conserved across various cysteineproteases. Recently, one of the Trp residues hasbeen shown by site-directed mutagenesis studiesto be vital for activity in transglutaminase,77 andhas been proposed to play a role in stabilizing thetransition state. Our results show a Cys/His/Asn(Asp)/Gln tetrad recurring in 18 of the cysteineproteases, while a larger pattern containing an

additional Trp was recurring in nine proteins. Ofthese, the catalytic triad residues are in agreementwith those reported in the literature.73,76,78,79 More-over, the sequence position of the recurring Glnresidue matches with that for the oxyanion Glnresidue of papain as proposed by Menard et al.75

In other structures, the Gln is within hydrogenbonding distance of the oxyanion and appears tocontribute to the stabilization of transition state(Figure 9(e)). This Gln was not detected in the ary-lamine N-acetyltransferase family (PDB codes1gx3 and 1et2). The conserved Trp appears only inproteins of the papain-like family.

Cys/His/His/Met/Asn copper coordination site

Blue copper proteins are electron transfer pro-teins with unusual spectroscopic properties (UV–visible and electron paramagnetic resonance(EPR)) and an exceptionally high redox potential.Blue copper proteins, including multicopperoxidases, feature a number of copper centers aspart of their overall catalytic apparatus.54 Theactive site has two His residues and one Cysresidue as the strong equatorial ligands and Met,Leu, Phe, or Gln as a weak axial ligand. Thus, thefour ligating residues form a distorted tetrahedron.One of the His residues is located on a b strand,while the other three are close together in theC-terminally located loop region. DRESPATanalysis of the 39 representative PDB structures

Figure 9. The important recur-ring patterns observed in theprotein families. Two or threerepresentative structures, super-imposed on each other, are shownfor each family. Separate panels areused for distinct patterns observedin a given family. Metal ions areshown in spacefill mode whereverthey are present within hydrogenbonding distance of the detectedpatterns. (a) Serine protease: 4cha,blue; 2sfa, red; 1azz, green; (b) EF-hand: 1tn4, red; 5tnc, blue; (c) con-canavalin A-like lectins: 1azd, red;1les, blue;1lul, green; (d) lipases:4lip, red 1cvl, blue; 1ex9, green;(e) cysteine protease: 1atk, red;1gec, blue; 1fh0, green; (f) cupre-doxin: 1aac, red; 1aq8, blue;1ag6,green; (g) ferritin: 1xsm, blue; 1xik,green; (h) cytochrome c: 1jju, red;1jmx, blue; 1cyi, green; (i) a-amyl-ase: 1amy, red;1aqh, blue,1bf2,green; (j) aspartoyl protease:1sme, red; 1htr, blue; 1b5f, green;

(k) phospholipase A2: 1ijl, blue; 1ae7,green, (l) phospholipase A2: 1cl5, red; 1gmz, blue; (m), glutathione-S-transferase:1aw9, red; 1e6b, blue; 1hna, green; (n) glutathione-S-transferase: 1fhe, red; 1gta, blue; (o) glutathione-S-transferase:1gsy, red; 1f3a, blue; 1hna, green; (p) SH3-domain: 1ad5, red; 1qwe, blue; (q) zinc finger: 1sp2, red; 1zaa, blue; 1yui,green; (r) ferredoxin: 1fb3, red; 1fnb, blue; 1jb9n, green; (s) peroxidase: 1bgp, red; 2atj, blue; 1qgj, green; (t) peroxidase:1bgp, red; 2atj, blue.

967

Page 14: Functional Sites in Protein Families Uncovered via an Objective and

Table 3. Recurring patterns observed in medium-size protein sets with between 16 and 50 proteins

Structural pattern PDB identification numbers with the residue positions for the detected pattern

A. Lipases (a/b hydrolases)Asp/His/Ser/Thrk ¼ 25, n ¼ 27 1bu8-a, 1gpl-a, 1hpl-a, 1rp1-a: 105/75/110/36; 1eth-a: 106/76/111/37; 1i6w-a: 133/156/77/–; 1jfr-a: 177/

209/131/178 (184); 1tca-a: 187/224/105/138; 1tia-a: 199/259/145/–; 1qge-d: 2/40/39/–, 36/40/39/–; 1tib-a: 201/258/146/–; 3tgl-a: 203/257/144/265 (173); 1lgy-a: 204/257/145/174; 1ex9-a: 229/251/82/114; 1evq-a:252/282/155/186: 1jji-a: 255/285/160/–; 1k8q-a: 257/6/254/8; 1cvl-a: 263/285/87/250; 4lip-a: 264/286/87/251; 1jkm-a: 308/338/202/–; 2bce-a: 320/435/194/397; 1hlg-a: 324/353/153/188; 1f6w-a: 438/435/194/316;1cle-a: 452/449/209/–; 1crl-a: 452/449/209/416; 1lpb, 1tic: pattern not detected

B. Cysteine proteinases

Cys/Asp/His/Asn/Gln/Trpk ¼ 19, n ¼ 19 1avp-a: 122/–/54/–/115; 1cv8-a: 24/–/120/141/18/143; 8pch-a:25/–/159/–/19/177; 1f2a-a: 25/158/159/

175/19/177; 1gec-e, 1ppo-a: 25/158/159/179/19/181; 2act-a: 25/161/162/182/19/184; 1fh0-a: 25/162/163/187/19/189; 1atk-a: 25/–/162/182/19/184; 1cqd-a: 27/–/161/181/21/183; 1qdq-a, 1the-a: 29/–/199/219/23/221; 1euv-a: 580/531/514/–/574/448; 1e2t-a: 69/122/107/–/–/–: 1gx3-a: 70/127/110/–/–/–; 1cb5-a:73/–/372/396/67/398; 1gcb-a: 73/–/369/392/67/394; 1cmx-a: 90/181/166/–/84/–; 1uch-a: 95/184/169/–/89/–

C. Cupredoxin

Cys/His/His/Met/Asnk ¼ 28, n ¼ 39 1azu-a, 1dyz-a, 1joi-a 1nwo-a, 1azc-a, 1cuo-a: 112/117/46/121/47; 1qhq-a: 122/127/57/132/58; 1kbw-a:

135/143/94/148/95; 1aq8-a, 1bq5-a: 136/145/95/150/96; 1f56-a: 74/34/79/84/35; 1pmy-a, 1adw-a, 1paz-a1bqk-a, 1bqr-a: 78/40/81/86/41; 2cbp-a: 79/39/84/89/40; 1ag6-a,1bxv-a,1byp-a, 1iuz-a, 1pcs-a, 1plc-a, 2plt-a, 7pcy-a: 84/37/87/92/38; 1kdj-a: 87/37/90/95/38; 1baw-a: 89/39/92/97/40; 1aac-a: 92/53/95/98/54;1a65, 1bq5, 1cyw, 1ehk, 1fwx, 1hfu, 1ibz, 1jer, 1qle, 1qni, 1rcy: Pattern not detected.

D. FerritinAsp/Glu/Glu/Glu/His/Hisk ¼ 13, n ¼ 20 1jgc-a, 1bcf-a: 126/127/51/94/130/54; 1ryt-a: –/128/53/20/131/56;1eum-a: –/130/49/50/46/53; 1mhy-d,

1mty-d 143/114/144/209(243)/147/246; 1r2f-a: 191/192/158/98/101/195; 1kgn-a: 201/108/168/202/111/205; 1afr-a: 228/143/196/229/146/232(203); 1xik-a: 237/115/204/238/118/241; 1xsm-a: 266/170/233/267/173/270; 1jk0-a: 272/176/239/273/179/276;1krq-a: 52/17/49/50/53/–; Pattern not found in: 1aew, 1dps,1exs, 1h96, 1mfr, 1qgh, 2fha

E. Cytochrome c

Cys/Cys/His/Met/Tyrk ¼ 50, n ¼ 50 1fcd-c:101/104/105/147/–, 11/14/15/54/–; 2dvh-a:10/13/14/57/–; 1a56-a, 1ayg-a:10/13/14/59/–;

1ezv:101D/104D/105D/225D/–, 133C/134C/183C/–/–; 1jju-a:11/14/15/43/–, 100/103/104/–/–; 1kx2-a:11/14/15/53/–; 1c52-a:11/14/15/69/–; 451c-a:12/15/16/22/–, 12/15/16/61/–; 1jmx-a:12/15/16/44/–,100/103/104/–/–; 1cor-a:12/15/16/61/–; 1hh7-a:13/16/17/14/–; 1cry-a:13/16/17/79/66; 1c2r-a:13/16/17/96/75; 1gks-a:14/17/18/55/–, 14/17/18/74/–; 1gdv-a:14/17/18/58/–; 1c6s-a:14/17/18/58/–, 14/17/18/19/–; 1cyi-a:14/17/18/60/–, 14/17/18/19/–; 1f1f-a:14/17/18/62/–;1qn2-a:14/17/18/78/–; 1hrc-a,1ycc-a, 5cyt-r:14/17/18/80/–; 1cyc-a:14/17/18/80/67; 3c2c-a:14/17/18/91/–; 1cno-a:14/17/18/60/–;1ql3-a:14/17/18/78/–; 1cot-a, 1cxc-a:15/18/19/100/79; 1c6o-a, 1ctj-a:15/18/19/61/–; 1jdl-a:15/18/19/98/–,15/18/19/16/–; 1iqc-a:183/186/187/258/–, 39/42/43/–/–; 1cc5-a:19/22/23/63/–, 19/22/23/84/–; 1hro-a:19/22/23/84/–; 1eb7-a:197/200/201/275/–, 51/54/116/–/–, 51/54/55/–/–; 1ccr-a:22/25/26/88/75;1c75-a:32/35/36/71/–; 1e29-a:37/40/41/–/–; 1bcc-d, 1be3-d:37/40/41/160/134; 1f1c-a:37/40/41/–/–,37/40/92/–/–; 1dw0-a:43/46/47/–/–; 1nir-a:47/50/51/88/–; 1e8e-a:49/52/53/–/–; 2mta-c:57/60/61/101/–; 1kb0-a:604/607/608/647/–; 1dii-c:615/618/619/650/–, 615/618/617/–/–; 1qks-a:65/68/69/–/–;1aof-a:65/68/69/106/–

F. a-Amylase

Asp/Asp/Asp/Glu/His/Arg/Tyrk ¼ 21, n ¼ 21 1gju-a: –/147/167/107/540/–/170; 1hvx-a: 101/331/234/264/106(330)/232/57; 1gcy-a: 112/294/193/219/

293/191/78; 2aaa-a, 6taa-a 117/206/297/230/296(122)/204(344)/82; 1qho-a: 127/329/228/256/132(328)/226(376)/92; 1cyg-a: 131/225/324/253/323(136)/223(371)/96; 1cdg - a, 1pam -a 135/229/328/257/327(140)/227/100;1ciu-a: 136/230/329/258/328(141)/228(375)/101; 1aqh-a: 174/264/84/200/263(89)/172/50; 1bag-a:176/269/97/–/102/–/62; 1amy-a: 179/289/87/204/92(288)/177/51; 1g5a-a: 182/393/286/328/392(187)/284/147; 1jae-a: 185/287/94/222/99(286)/183/60; 1eh9-a: 187/377/252/283/376(192)/250/152; 1pig - a,1smd - a 197/300/96/233/101(299)/195/62; 1uok-a: 199/329/98/255/103(328)/197(415)/63;1bvz-a: 239/421/325/354/420(244)/323/191(204); 1sma-a: 242/424/328/357/247(423)/326(472)/207; 1bf2-a: 292/510/375/435/297(509)/373/250

(continued)

968

Page 15: Functional Sites in Protein Families Uncovered via an Objective and

Table 3 Continued

Structural pattern PDB identification numbers with the residue positions for the detected pattern

G. Acid protease

Asp/Asp/Thr/Thrk ¼ 21, n ¼ 21 1sme-a, 1qs8-a: 214/34/217/35; 2jxr-a, 1epn-e, 1am5-a, 1mpp-a, 1b5f-a, 4cms-a, 1pso-e, 3psg-a: 215/32/218/

33; 1hrn-a, 1smr-a: 215/32/216/33; 1htr-b:217/32/220/33; 1zap-a:218/32/221/222; 2apr-a:218/35/221/36;1j71-a: 218/32/221/33; 1qdm-a:223/36/226/37; 1fkn-a:228/32/231/33; 1lyb: 69A 74B 71A 76B; 2asi-a:237/38/240/39; 1dif:25A/25B/26A/26B; 1ida:25A/25B/26A/26B; 1siv:25A/25B/26A/26B; 1ppm-e:213/33/216/34

H. Phospholipase A2

Asp/His/Met/Gln/Tyr/Tyrk ¼ 19, n ¼ 20 1ae7, 1aok: 99/48/8/4/52/73; 1bun, 1cl5, 1clp:99/48/8/–/52/73; 1dpy:94/48/–/4/52/68, 89/47/8/–/51/

64; 1gmz:89/47/–/4/51/64, 99/48/8/–/52/73;1god:99/48/–/4/52/73;1ijl:89/47/–/4/51/64, 91/47/8/–/51/66;1kvo:93/47/8/–/51/67, 91/47/–/–/51/66;1poa:93/47/–/4/51/67; 1pp2:99R/48R/–/4R/52R/73R, 93A/47A/8A/–/51A/67A; 1psh:93/47/–/4/51/67; 1psj:99/48/–/4/52/73, 89/47/8/–/51/64;1qll:89/47/–/–/51/64;1vap:89/47/–/4/51/64, 99/48/8/–/52/73;1vip:99/48/–/4/52/73;2not:99/48/8/4/52/73;4bp2:99/48/–/4/52/73; Pattern not found in: 1poc

Asp/Asp/Asn/Thr/Tyr/Tyr/Tyrk ¼ 20, n ¼ 20 1ae7: 39/42/115/–/111/22/25; 1aok, 1clp, 1god, 1psj, 1vip, 1pp2: 39/42/109/41/113/22/25; 1bun:39/

42/–/–/106/22/25; 1cl5:39/42/109/41/113/22/25; 1dpy:39/42/110/–/106/22/25; 1gmz:38/41/99/40/103/21/25;: 1ijl, 1qll, 1vap:38/41/99/40/103/21/24; 1kvo:38/41/101/40/105/21/24; 1poa:38/41/108/–/105/–/24; 1poc:–/–/73/–/134/68; 1psh:38/41/–/–/105/–/24, 2not, 4bp2:39/42/115/–/111/22/25

I. Restriction endonuclease-like

Asp/Glu/Lys/Serk ¼ 19, n ¼ 21 1f1z-a:114/63/132/60, 114/63/132/112;1dmu-a:142/87/144/145(92); 1avq-a:119/85/131/117, 21/18/82/

117; 1kc6:127A/38A/129A/–, 111D/107D/108D/–, 114A/38A/129A/–, 165C/189C/–/2C, 216B/197B/–/229B, 165C/189C/–/193C, 216B/194B/–/229B, 172B/170B/–/136B; 1cfr-a:134/204/190/131, 116/112/120/–, 68/71/64/–, 276/80/–/85; 1fiu:140/70(201)/187/183; 1azo-a:151/109/118/–; 1dc1-a:177/181/173/148; 1rva: 90A/45A/92A/41A; 1d02-a:202/90/198(93)/7; 1fok:347/339/394/395, 421/425/469/–; 467/484/–/446, 85/48/–/446, 211/214/–/446; 1knv-a:48/51/44/72, 146/212/198/143, 117/119/187/190; 1hh1-a:61/63/60/117; 1gef:61A/46A/4B/29A, 33B/9B/48B/–; 1fzr:74A/71A/65A/17D/, 55C/20D/65C/17D;1d2i-a:84/93/–/97; 1ev7:86A/70A/59B/111A/,146B/110A/148A/111A; 1bam-a:94/111/61 (113)/123;1eri-a:99/103/98/159, 247/245/29/–, 91/111/–/146(39); Pattern not found in:1eyu,1vsr

J. Glutathione-S-transferase

Asp/Arg/Serk ¼ 14, n ¼ 27 1k0m-a: 141/208/211; 1hqo-a: 145/120/121; 2gsr-a: 150/180/182, 150/184/182; 1gsy-a, 3gss-a: 152/182/184,

152/186/184; 1f3a-a: 156/186/188, 100/68/95; 1gsd-a: 157/187/189; 1guk-a, 1ev4-a: 157/187/189, 101/69/18(96); 1eem-a: 238/30/28; 1a0f-a: 24/19/21; 1gsu: 55B 139A 138A; 1fhe-a: 76/88/92; 1gta-a: 77/89/93; Pat-tern not found in: 1aw9-a, 1axd-a, 1e6b-a, 1f2e-a, 1fw1-a, 1g7o-a, 1gnw-a, 1hna-a, 1ljr, 1pd2, 1pmt,2gsq-a,3fyg

Asp/Lys/Tyrk ¼ 13, n ¼ 27 1g7o-a: 150/92/96; 1fhe-a: 159/179/155; 1gta-a: 160/180/156, 60/78/74; 1axd-a, 1ljr-a: 59/77/73(78); 1f3a-a:

60/77/73; 1aw9-a: 60/78/74; 1ev4-a, 1gsd-a,1guk-a: 61/78/74(79); 1gsu-a, 1hna-a: 64/82/78; 1e6b-a: 65/83/79(84); Pattern not found in: 1a0f-a, 1eem-a, 1f2e-a, 1fw1-a, 1gnw-a, 1gsy-a, 1hqo-a, 1k0m-a, 1pd2, 1pmt,2gsq-a, 2gsr-a, 3fyg, 3gss-a

Glu/Arg/Tyrk ¼ 12, n ¼ 27 1aw9-a:103/69/108(172); 1f2e-a:139/143/181; 2gsq-a:15/197/157; 1guk-a:17/20/9; 1hna-a:29/17/6; 1gsy-a,

2gsr-a, 3gss-a::30/18/7; 1f3a-a:31/19/8; 1ev4-a, 1gsd-a::32/20/9; 1ljr:97A 94A 73B Pattern not found in:1a0f-a, 1axd-a, 1e6b-a, 1eem-a, 1fhe-a, 1fw1-a, 1g7o-a, 1gnw-a, 1gsu-a, 1gta-a, 1hqo-a, 1k0m-a, 1pd2, 1pmt, 3fyg

K. SH3 domain

Asp/Asn/Ser/Trp/Trp/Tyr/Tyrk ¼ 19, n ¼ 27 1neb-a:11/–/–/–/–/12/56;1lck-a:–/114/–/97/98/–/–;1pht-a:13/–/–/–/–/12/73;2pni-

a:13/–/–/–/–/14/73;1shg-a:14/–/–/–/–/13/57;1qwe-a:15/59/58/42/43/14/60;1awx-a:16/59/–/43/44/15/60;1sem-a:–/206/205/191/192/–/–;1ycs-a:–/513/–/498/499/–/–;1i0c-a:–/57/–/40/41/–/–;1hsq-a:–/57/–/41/42/–/–;1gcp-a:–/653/–/636/637/–/–;1abo-a:71/114/113/99/–/115/118;1gri-a:8/208/–/193/194/–/–;2abl-a:90/133/132/118/–/89/134;1ad5-a, 1fmk-a:91/135/134/118/119/136/90;2ptk-a:91/135/134/118/–/136/90;1shf-a:92/136/135/119/120/137/91; Pattern not found in: 1awj, 1bb9, 1cka,1gbr, 1gl5, 1i1j, 1jeg, 1kjw

Refer to the footnote to Table 2 for RMSD and deviation cutoffs, reporting conventions and protein sets.

969

Page 16: Functional Sites in Protein Families Uncovered via an Objective and

from the cupredoxins superfamily from the SCOPdatabase30 shows Cys/His/His/Met/Asn as themost frequently recurring pattern, being presentin 28 proteins. This reveals Asn as an additionalresidue in the copper-binding site in addition tothe reported copper coordination site, which takesup an axial position, on the opposite side of theknown axial ligand Met (Figure 9(f)). It is possiblethat the Asn residue plays a role as a weak axialligand of the copper. On the primary sequence, weobserve that the Cys/His/His/Met/Asn patternfollows the motif H–N–X38 – 44–C–X2 – 4–H–X2 – 4–M, with the Asn lying on the b strand. It is possiblethat some proteins contain a Cys/His/His/Leu orCys/His/His/Phe pattern with Leu or Phe in theaxial ligand position. However, since the DRESPATprogram ignores the hydrophobic residues, thesepotential patterns are not detected.

Asp/Glu/Glu/Glu/His/His metal co-ordination sitein ferritin

The ferritin-like superfamily contains ferritinsand ribonucleotide reductase families (SCOP).Ferritins are iron storage proteins that oxidize Fe2þ

to Fe3þ,80 while ribonucleotide reductases catalyze

reduction of all four ribonucleotides and contain adinuclear iron center as a part of the catalyticmachinery. There is an increasing interest in thedevelopment of inhibitors of ribonucleotidereductases as anticancer, antibacterial and antiviraldrugs. Our results show an Asp/Glu/Glu/Glu/His/His pattern conserved in 13 of the 20 proteinsin this superfamily. The amino acid positionsmatch with the reported iron ligands (Figure9(g)).80 – 83 Interestingly, the pattern was observedonly in the structures that are known to bind iron.Other members of the superfamily, such as 1mfror 1exs, do not contain this pattern and wereprobably assigned to this SCOP superfamily onthe basis of overall fold similarity rather than“iron-binding” function.

Cys/Cys/His/Met/Tyr heme co-ordination site incytochrome c

Class I, c-type cytochromes, functionally diverseredox proteins of 80–130 amino acid residues, arecharacterized by a common N-terminal hemeco-ordination sequence of C–X2 –C–H, with Hisand Met residues as extra-planar ligands.84 Inboth prokaryotes and eukaryotes, cytochrome c

Table 4. Recurring patterns observed in small protein sets with less than 15 proteins

Structural pattern PDB identification numbers with the residue positions for the detected pattern

A. Zinc fingerCys/Cys/His/Hisk ¼ 14, n ¼ 14 1ard-a: 106/109/122/126; 1znf-a: 3/6/19/23; 1yui-a: 36/39/52/57; 1zfd-a: 44/49/62/66; 1sp2-a: 5/

10/23/27; 5znf-a: 5/8/21/26; 4znf-a: 5/8/21/27; 1bhi-a: 9/14/27/31; 1rmd-a: 91/96/108/112; 2drp-a:113/116/129/134, 143/146/159/164: 1zaa-c: 7/12/25/29, 37/40/53/57, 65/68/81/85; 1ubd-c: 298/303/316/320, 327/330/343/347, 355/360/373/377, 385/390/403/407; 2gli-a: 106/111/124/129, 139/144/160/164, 172/177/190/194, 202/207/220/225, 233/238/251/256: 1tf6-a: 15/20/33/37, 45/50/63/67, 75/80/93/98, 107/112/125/129, 137/142/155/159, 164/170/183/188

B. Ferredoxin-like

Cys/Met/Ser/Thr/Tyrk ¼ 12, n ¼ 14 1ndh-a:175/148/145/142(173); 1i7p-a:203/176/173/201/–; 1a8p-a:219/224/221(114)/117/–, 219/

217/54/98/–, 229/224/223/193/–; 1que-a:261/266/80/157/303; 1qg0-a:266/271/90/166/308;1ep1:26A/247A/24A/198A/–, 67B/33B/59B/66B/–, 67B/23B/12B/9B/–; 1fnb-a, 1gaq-a:272/277/96/172(170)/314; 1jb9-a:274/279/94/175/316; 2pia-a:280/306/279(271)/276/–, 308/306/270(271)(279)/–; 1fb3-a:320/325/144/220/362; 1cqx-a:368/281(284)/209(273)/279/–; 1amo-a:445/346/347/366 (369)/–; 1i8d:48C/1C/146A(41C)/3A(148A); 1ddg-a:552/558/389/462/599; Pattern notfound in: 1qfj, 2cnd

Cys/Asp/Glu/His/Arg/Tyrk ¼ 7, n ¼ 14 1que-a: 98/88/204/42/163/201; 1amo-a: 472/–/578/302/290/575; 1fb3-a: 162/153/263/107/226/

260; 1fnb-a: 114/104/215/59/178/212; 1gaq-a: 114/–/215/59/178/212; 1jb9-a: 112/102/218/55/181/215; 1qg0-a: 108/98/209/53/172/206/, Pattern not found in: 1a8p, 1cqx, 1ddg, 1ep1, 1i7p, 1i8d, 1ndh,1qfj, 2cnd, 2pia

C. Peroxidase

Asp/His/Ser/Thr/Tyrk ¼ 7, n ¼ 14 1bgp-a: 250/179/176/180/236; 2atj-a: 247/170/167/171/233; 1fhf, 1qo4: 246/169/166/170/232; 1qgj:

242/165/162/166/228; 1sch: 239/169/166/170/225; 1apx: 208/163/160/164/190; Pattern not foundin: 1aru, 1cx2, 1llp, 1mhl, 1mnp, 1myp, 1pth, 2cyp

Asp/Asp/His/Arg/Serk ¼ 6, n ¼ 14 1bgp-a: 50/57/47/132/105; 1fhf-a, 1qo4-a, 1sch-a, 2atj-a, 1qgj-a: 43/50/40/123/96; Pattern not found

in: 1apx, 1aru, 1cx2, 1llp, 1mhl, 1mnp, 1myp, 1pth, 2cyp

Refer to footnote to Table 2 for RMSD and deviation cutoffs, reporting conventions and protein sets.

970

Page 17: Functional Sites in Protein Families Uncovered via an Objective and

functions to transport electrons between two inte-gral membrane energy-transducing complexes.Amino acid positions of the observed recurringpattern match with the reported heme co-ordinat-ing site (Figure 9(h)). However, the role of Tyr hasnot been reported.

Asp/Asp/Asp/Glu/His/Arg/Tyr active site in a-amylases

Despite differences in the amino acid sequences,members of this superfamily have similar 3D struc-tures, with three domains.85 a-Amylases have twoconserved Asp residues and one Glu residue thatare considered as the catalytic residues on thebasis of structural and site-directed mutagenesisstudies. We detected the Asp/Asp/Asp/Glu/His/Arg/Tyr pattern where the residue positionsmatch with those reported for the active-site Asp,Glu and His residues.85 – 87 Of these, Glu acts as ageneral acid, while one of the Asp residues acts asa general base (Figure 9(i)). The pattern detectedin the PDB structure 1bag, which is the X-ray crys-tal structure of a catalytic-site mutant Glu208 !Gln of a-amylase from Bacillus subtilis, lackedthe Glu residue.88 Two His residues are reportedto be located around the active site, with at leastone of them involved in hydrogen bonding withthe catalytic Asp.88 The Tyr has been implicated instacking interactions with the glucose moiety adja-cent to the scissile glucosidic bond,88 while theroles of the other active site residues are not welldefined.

Asp/Asp/Thr/Thr catalytic tetrad inaspartoyl proteases

Acid proteases or aspartoyl proteinases arecharacterized by the presence of two Asp residuesat the active site in a complex network of hydrogenbonds with other residues. Acid proteases can bebroadly divided into two groups: the pepsin likeand the retroviral enzymes.53 The pepsin-likeenzymes are monomers containing between 320and 360 amino acid residues, with a 2-fold sym-metry between the two lobes of the molecule.53

The retroviral enzymes have between 99 and 150amino acid residues existing as dimers in theiractive form.53 The retroviral enzymes are of signifi-cant interest as therapeutic targets. The Asp/Asp/Thr/Thr pattern was detected in 24 of the 28 pro-teins. This tetrad is composed of Aspx-Thrxþ1 andAspy-Thryþ3 diads separated by ca 183 residues inpepsin-like enzymes (Figure 9(j)). In retroviralenzymes, the two dyads are from two differentmonomers. This active-site structure composed oftwo dyads has been well documented.53,89 Theseresults show that DRESPAT is able to extract thefunctionally relevant pattern as the most frequentlyrecurring pattern even in a highly conservedfamily of aspartoyl proteinases.

An active site and a structural site inphospholipase A2

Phospholipase A2 (PLA2) are low molecularmass enzymes containing between 110 and 125amino acid residues, are distributed widely in theanimal world and have been classified into groupsI and II.90 – 92 The enzymes from second group havea C-terminal extension of five to seven amino acidresidues. PLA2s vary considerably in their primaryand quaternery structure, in their toxicity and intheir enzyme activity. PLA2s share a commonqualitative catalytic property but differ greatly intheir pharmacological properties. Yet, a significantdegree of homology is observed between tertiarystructures of various PLA2s, which typically aremade of three helices, two short helices, a b-wingand six to seven loops.90 The catalytic site hasbeen reported to comprise of one Asp, one Hisand two Tyr residues.93 We observed an Asp/His/Met/Gln/Tyr/Tyr pattern recurring in 19 of the 20proteins of this superfamily with matching residuepositions with the reported active site.89 – 91 The con-served Gln has been implicated in hydrogen bond-ing with the catalytic Tyr (Figure 9(k)).94 The role ofthe detected Met residue has not been reported.This patterns was not detected in the insect PLA2(PDB code 1poc). In addition, we detected anAsp/Asp/Asn/Thr/Tyr/Tyr/Tyr pattern awayfrom both the active site and the calcium-bindingsite, and was conserved in all 20 of the proteins(Figure 9(l)). The pattern draws three residuesfrom helix 2 (Asp38, Thr40 and Asp42), two resi-dues from loop 2 (Tyr22 and Tyr25), one residuefrom short helix 5 (Tyr113) and one residue fromthe C-terminal end (Tyr118) (numberings for 1vip).Such an arrangement with possible stacking inter-actions among the Tyr residues is potentiallyimportant for maintaining the structure of PLA2s.

Asp/Glu/Lys/Ser active site inrestriction endonucleases

Type II restriction endonucleases usually recog-nize a short palindromic sequence between 4 bpand 8 bp in length. In the presence of Mg2þ,hydrolysis of phosphodiester bond is catalyzed ata precise location within or close to this sequence.95

Structural studies have revealed the surprising factthat, despite the absence of readily detectablesequence similarity, all type II enzymes fold intoone of two topologies defined by the originalexamples, Eco RI and Eco RV.96 On the basis of thestructures of enzyme–DNA complexes and proteinprimary sequence, a conserved cluster of Asp, Gluand Lys is thought to be involved in the activesite. These residues come from the loosely con-served DXn(E/D)ZK motif (where Z is a hydro-phobic residue). We detect an Asp/Glu/Lys/Serpattern in 18 of the 21 structures. The residue pos-itions match with the reported active-site residuesfor several enzymes, such as BgI (PDB code1dmu95), Bam HI (PDB code 1bam97), Fok I (PDB

971

Page 18: Functional Sites in Protein Families Uncovered via an Objective and

code 1fok), Eco RV (PDB code 1rva) and Bse634I(PDB code 1knv98). However, a complete disagree-ment was observed in the reported active-siteresidue positions for some proteins, such as Bso BI(PDB code 1dc196) and Mun I (PDB code 1d02).The reported sites Bso BI and Mun I are based on aputative DXn(E/D)ZK ion-binding motif andsuperimposition of overall structures rather thanexperimental detection of a Mg2þ-binding site.Our detection, on the other hand, is based on struc-tural superposition of the residues that comprisethe pattern.

Novel conserved sites in glutathione-S-transferase

Glutathione-S-transferases (GST) are a family ofmultifunctional enzymes that metabolize a widevariety of electrophilic compounds via GSHconjugation.99 GSTs have been grouped into severalclasses, namely Alpha, Mu, Pi, Theta, Sigma,Kappa and Zeta, on the basis of their physical,chemical, immunological and structural properties.Despite the low level of sequence identity betweenclasses (often less than 20%), the 3D structures ofenzymes from various classes are very similar andare typically present as homodimers. The activesite of a Mu class GST is reported to consist ofAsp105, Ser72 and Asn58,100 while that of a Betaclass GST comprises of Cys10, His105 andLys107.99 We did not detect any of the reportedactive sites in the recurring patterns. Three distinctrecurring patterns were observed; namely, Asp/Arg/Ser in 14 proteins, Asp/Lys/Tyr in 13proteins and Glu/Arg/Tyr in 12 proteins of the 27proteins. The Asp/Lys/Tyr pattern (Figure 9(m))appears at the protein–protein interface of the twodimers and could be important in maintaining theinter-subunit contacts. The Asp/Arg/Ser (Figure9(n)) and the Glu/Arg/Tyr (Figure 9(o)) patternsare away from the protein–protein interface andcould be examined as further potential functionalsites.

Ligand-binding sites in the SH3 domain

The SH3 domain, a ubiquitous protein modulewith approximately 60 residues, is present inmany signaling proteins, regulatory proteins andstructural proteins.101 Through binding to Arg,Leu or Pro-rich motifs in target proteins, SH3 pro-teins are able to assemble multimeric signalingcomplexes. The NMR structure of a complexbetween the SH3 domain and a model peptidereveals the peptide-binding site of SH3 to be com-posed of two Try residues, one Trp residue andone Phe residue.102 We detect the Asp/Asn/Ser/Trp/Trp/Tyr/Tyr patterns in 19 of the 27 proteins.The residues of the recurring patterns match withthe reported ligand-binding subsites of the SH3domain (Figure 9(p)).101 – 103 In fact, the detectedpatterns span three to four subsites of the proposedligand-binding sites.

Small protein sets (with 15 or fewer proteins)(Table 4)

Cys/Cys/His/His metal coordination sites in zincfinger proteins

Zinc finger proteins constitute one of the biggestfamilies of transcription factors and can be dividedinto many subclasses on the basis of, among otherthings, the number and type of zinc fingers theycontain. In proteins containing zinc finger motifs,the cysteine and/or histidine metal ligands withinthe motif fold in a tetrahedral configuration arounda central zinc ion.104,105 Members of the classicalzinc finger superfamily (SCOP nomenclature30)each contain one to six tandem Cys2-His2 zincfinger motifs.106 – 110 The DRESPAT programdetected the Cys/Cys/His/His structural patternin all the 14 representative PDB structures, whichis the zinc coordination site known to be specificto the classical zinc finger (Figure 9(q)).111,112 Infact, the pattern recurs at one to six distinct placeswithin a chain of zinc finger protein. The numberof recurrences of the pattern was found to beequal to the number of zinc ions bound to thegiven chain. For example, the PDB file 2GLI con-tains the Cys/Cys/His/His pattern in five placesand has five zinc-binding sites.

Cys/Met/Ser/Thr/Tyr active site in ferredoxin

Ferredoxin (flavadoxin)-NADPþ oxidoreductasecatalyzes the reversible transfer of electronsbetween pyridine nucleotides and flavodoxins orferrodoxins.113 The monomeric molecule is dividedinto two distinct domains, the FAD-binding andNADP-binding domains, which are connectedonly with one covalent connection113,114 and is trueabout all ferredoxins. Members of this superfamilyshare a low level of sequence similarity but a highlevel of structural similarity. Certain Arg and Gluresidues are proposed to be part of the FAD-bind-ing site, while Cys and Ser residues are reportedto be in the active site. We observe a Cys/Met/Ser/Thr/Tyr pattern in seven proteins, while asubset pattern, Cys/Met/Ser/Thr, in 12 proteinsof the 14 proteins in the input set (Figure 9(r)). Ofthese, the Cys, Ser and Thr residue positionsmatch with the proposed active site of spinachferredoxin reductase.115 The Met and Tyr residueshave not been reported to participate in the activesite. An independent pattern, Cys/Asp/Glu/His/Arg/Tyr, was present in seven proteins. Theamino acid residues of this pattern lie in the groovebetween the two domains of the monomericprotein. This region could be important in main-taining interactions between the two domains andpotentially critical for activity. These residues havenot been reported to participate in either catalyticmachinery or the co-factor binding. On the otherhand, the reported co-factor binding residues,such as Tyr235, Gln237, Ser223, Arg224, Arg233and Tyr235 (1que numbering114), have not been

972

Page 19: Functional Sites in Protein Families Uncovered via an Objective and

found to be part of any of the significant recurringpatterns in our studies.

The proximal heme-binding and distal cation-binding sites in peroxidases

The majority of known peroxidases belong to theplant peroxidase superfamily, which is character-ized by a central heme group sandwiched betweena distal and a proximal protein domain. The proxi-mal heme side of peroxidases typically has His as aproximal ligand of the heme iron, where the His ishydrogen bonded to the invariant proximalAsp.116 The Asp is, in turn, hydrogen bonded to aSer residue and a Thr residue. In addition, twocation-binding sites are found in peroxidases: theproximal cation-binding site and a distal cation-binding site. The latter site is characterized by twoAsp residues. We detect two recurring patterns.The Asp/His/Gln/Ser/Thr/Tyr pattern is locatedin the proximal heme binding site, with the residuepositions of Asp, His, Ser and Tyr matching withthe reported active-site residues (Figure 9(s)),116,117

while the Asp/Asp/His/Arg/Ser pattern islocated in the distal cation-binding site, with thetwo Asp acid residue positions in agreement withthe reported distal cation ligands (Figure 9(t)).

Discussion

This is the first report of an objective analysis ofrecurring structural patterns in protein families.The attempt to enumerate all patterns of a proteinis a novel feature of the approach. The fact thatthe method does not require a user-definedpattern, location or even size of the pattern is a sig-nificant improvement over the methods reportedto date. Contrary to the claims made by Russell,37

we show clearly that it is not feasible to extractbiologically meaningful patterns from anevolutionarily related protein pair. This results ina high false positive rate at RMSD cut-off valuesof ca 1 A with a large number of patterns detectedas common to the protein pair. On the other hand,a high false negative rate is observed at RMSDcut-off values of ca 0.1 A, as this RMSD toleranceis far below the typical resolution of an X-raycrystallographic or NMR structure. A correlationbetween the recurring pattern and biological sitewas observed for protein sets with five or moreproteins.

Our detection of patterns is biased, as we haveused 13 charged/polar amino acids, Asp, Asn,Arg, Glu, Lys, Ser, Trp, Cys, Gln, His, Met, Thr,Tyr, and ignored the hydrophobic amino acids.Thus, we envisage the number of patterns detectedto be strongly dependent on the amino acid com-position of the protein in addition to the size ofthe protein. This is evident from the wide scatterobserved while plotting the number of patternsagainst the number of amino acid residues in theprotein (Figure 5). In addition, our bias in selection

of amino acids has led to detection of patterns closeto the protein surface as we ignore the hydro-phobic amino acids, which typically make up thehydrophobic core of the protein.

The plot of statistical significance of a recurrencefrequency of a pattern (Figure 7) provides a basisfor selecting the patterns for further analysis. Vary-ing the RMSD cutoff value between 0.5 A and 1.3 Adid not affect the results significantly (data notshown). With higher RMSD cutoff values,a largernumber of recurring patterns was detected, whichleads to lower statistical significance for a givenrecurrence frequency of the pattern. Thus, the stat-istical significance of a given recurring pattern didnot change noticeably upon changing the RMSDcutoff between 0.5 A and 1.3 A, although the recur-rence frequency changed slightly. However, atRMSD cut-off values of less than 0.5 A, a signifi-cant false negative rate was observed with severalknown similar pattern pairs returned with a “notsimilar” value. Also, in some cases, a reportedactive site is shared by a small number of PDBentries belonging to a larger family. Due to ourstrategy of reporting patterns with recurrencefrequency above a certain threshold, biologicallysignificant patterns with a low frequency of recur-rence are not reported. An example of a missedpattern is the Ser/His/Glu active-site patternshared by a relatively small number of lipases,while we detect the Ser/His/Asp pattern, whichis present in a majority of the lipases. This can beviewed as a potential drawback of the method.

The results have been broken into three artificialcategories on the basis of the number of non-redundant PDB structures available in each SCOPsuperfamily. We have detected biologicallyrelevant patterns in all 17 of the protein sets, irre-spective of the size of the protein set. We presentan analysis on the minimum number of PDB struc-tures required in the protein set to detect thegolden pattern. From three to five PDB structuresenable retrieval of the golden pattern in mostcases (data shown for two protein sets). Dis-tinguishing between patterns of biological signifi-cance and random occurrences has been relativelystraightforward for the test families used here. Weobserve that the statistically significant recurringtetrads, pentads and hexads typically correspondto the functionally relevant sites. This was notnecessarily true about the triads. For example, incysteine proteases, several triads were observedwith recurrence frequency greater than that for theactive site Cys/His/Asn(Asp). On the other hand,the biologically relevant Cys/His/Asn(Asp)/Gln/Trp and Cys/His/Asn(Asp)/Gln patterns werethe most recurring pentads and tetrads, respect-ively. Similarly, in zinc finger proteins, the Cys/Cys/His/His pattern was the only tetrad recurringin all of the 14 representative members of thefamily. Likewise, Cys/His/His/Met/Asn was theonly pentad with a recurrence frequency muchabove the threshold value in the cupredoxins. Thisis true of all the protein sets presented here. In

973

Page 20: Functional Sites in Protein Families Uncovered via an Objective and

ferritin, EF-hand and lectins, superfamilies of theSCOP database are characterized by their metalion-binding abilities. Several proteins from each ofthese superfamilies have lost the ion-binding capa-bility despite overall structural similarities. Basedon the recurring patterns, we could distinguishproteins or domains that bind metal ions fromthose that do not.

Due to our strategy of evaluating all possiblepatterns of three to six amino acid residues, whena recurring hexad or pentad is observed, allpossible subsets of this pattern are detected asrecurring patterns. Therefore, we chose to analyzethe DRESPAT results in descending order of thesize of the pattern. As an example, the reportedfunctional site pattern in cysteine proteases is atriad, but we detected a tetrad and pentad withmuch greater statistical significance.

In all of the 17 protein families considered here,the known functional sites have been detected asthe most frequently recurring patterns. TheDRESPAT method has uncovered additional recur-ring residues in the vicinity of the known func-tional sites (Figure 8). In serine proteases andlipases, the Thr residue detected is within hydro-gen bonding distance of the oxyanion hole andprobably plays a role in transition-state stabiliz-ation. In cysteine proteases, the results show a con-served Gln residue in the vicinity of the catalytictriad. This residue has been implicated in tran-sition-state stabilization in some of the reportedstructures.75 Similarly, in blue copper proteins,the Asn residue appears to take up an axial posi-tion in the copper coordination site and probablyserves as the fifth ligand for the copper ion. It ispossible that the Asn residue plays a role in thelong-range electron transfer, which is a character-istic feature of cupredoxins. The Thr residuedetected in serine proteases and lipases or the Asnresidue detected in copper proteins have not beenimplicated in function. Energy calculations, theoreti-cal studies of the patterns and mutagenesis studiesof the newly uncovered recurring residues wouldprovide more insights into the functional role ofthese residues.

With the exceptions of the zinc finger, cupre-doxin, EF-hand, and cytochrome c, the detectedrecurring structural patterns do not follow anysequence motif at the level of protein primaryamino acid sequence. Thus, it would not bepossible to detect such patterns on the basis ofsequence alignment alone. Also, the patternsreported here cannot be detected using thealgorithms that are meant for overall structurecomparison and alignment. The method can beapplied in identifying functional sites and predict-ing function, as well as in classification of proteinsbased on recurring patterns. Future improvementsof the method could include detection of patternscomposed of main-chain atoms in addition to theside-chain atoms as described here, as severalmain-chain nitrogen or oxygen atoms are knownto play a role in catalysis.

Availability of the computer program

A Cþþ implementation of the DRESPATprogram can be obtained from the authors.

References

1. Terwilliger, T. C. (2000). Structural genomics inNorth America. Nature Struct. Biol. 7, 935–939.

2. Grishaev, A. & Llinas, M. (2002). Protein structureelucidation from NMR proton densities. Proc. NatlAcad. Sci. USA, 99, 6713–6718.

3. Hus, J.-C., Marion, D. & Blackledge, M. (2000). Denovo determination of protein structure by NMRusing orientational and long-range order restraints.J. Mol. Biol. 298, 927–936.

4. Ohki, S., Eto, M., Kariya, E., Hayano, T., Hayashi, Y.,Yazawa, M. et al. (2001). Solution NMR structure ofthe myosin phosphatase inhibitor protein CPI-17shows phosphorylation-induced conformationalchanges responsible for activation. J. Mol. Biol. 314,839–849.

5. Zheng, W. & Doniach, S. (2002). Protein structureprediction constrained by solution X-ray scatteringdata and structural homology identification. J. Mol.Biol. 316, 173–187.

6. Panchenko, A., Marchler-Bauer, A. & Bryant, S. H.(1999). Threading with explicit models for evol-utionary conservation of structure and sequence.Proteins: Struct. Funct. Genet., 133–140.

7. Lee, M. R., Tsai, J., Baker, D. & Kollman, P. A. (2001).Molecular dynamics in the endgame of proteinstructure prediction. J. Mol. Biol. 313, 417–430.

8. de la Cruz, X., Sillitoe, I. & Orengo, C. (2002). Use ofstructure comparison methods for the refinement ofprotein structure predictions. I. Identifying thestructural family of a protein from low-resolutionmodels. Proteins: Struct. Funct. Genet. 46, 72–84.

9. Borchert, T. V., Abagyan, R. A., Kishan, K. V. R.,Zeelen, J. Ph. & Wierenga, R. K. (1993). The crystalstructure of an engineered monomeric triosepho-sphate isomerase, monoTIM: The correct modelingof an eight-residue loop. Structure, 1, 205–213.

10. Baker, D. & Sali, A. (2001). Protein structure predic-tion and structural genomics. Science, 294, 93–96.

11. Kihara, D., Zhang, Y., Lu, H., Kolinski, A. &Skolnick, J. (2002). Ab initio protein structure predic-tion on a genomic scale: application to the myco-plasma genitalium genome. Proc. Natl Acad. Sci.USA, 99, 5993–5998.

12. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G.,Bhat, T. N., Weissig, H. et al. (2000). The ProteinData Bank. Nucl. Acids Res. 28, 235–242.

13. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B.,Meyer, E. F., Brice, M. D., Rodgers, J. R. et al.(1977). The Protein Data Bank: a computer-basedarchival file for macromolecular structures. J. Mol.Biol. 122, 535–542.

14. Maggio, E. T. & Ramnarayan, K. (2001). Recentdevelopments in computational proteomics. TrendsBiotechnol. 19.

15. Kelley, L. A., MacCallum, R. M. & Sternberg, M. J.E. (2000). Enhanced genome annotation using struc-tural profiles in the program 3D-PSSM. J. Mol. Biol.299, 510–522.

16. Aloy, P., Querol, E., Aviles, F. X. & Sternberg, M. J. E.(2001). Automated structure-based prediction offunctional sites in proteins: applications to assessing

974

Page 21: Functional Sites in Protein Families Uncovered via an Objective and

the validity of inheriting protein function fromhomology in genome annotation and to proteindocking. J. Mol. Biol. 311, 395–408.

17. Elcock, A. H. (2001). Prediction of functionallyimportant residues based solely on the computedenergetics of protein structure. J. Mol. Biol. 312,885–896.

18. Blom, N., Gammeltoft, S. & Brunak, S. (1999).Sequence and structure-based prediction ofeukaryotic protein phosphorylation sites. J. Mol.Biol. 294, 1351–1362.

19. Rutenber, E., Fauman, E. B., Keenan, R. J., Fong, S.,Furth, P. S. & Ortiz de Montellano, P. R. (1993).Structure of a non-peptide inhibitor complexedwith HIV-1 protease. Developing a cycle of struc-ture-based drug design. J. Biol. Chem. 268,15343–15346.

20. Skolick, J. & Fetrow, J. S. (2000). From genes to pro-tein structure and function: novel applications ofcomputational approaches in the genomic era.Trends Biotechnol. 18, 34–39.

21. Norwell, J. C. & Machalek, A. Z. (2000). Structuralgenomics programs at the US National Institute ofGeneral Medical Sciences. Nature Struct. Biol. 7, 931.

22. Yokoyama, S., Hirota, H., Kigawa, T., Yabuki, T.,Shirouzu, M., Terada, T. et al. (2000). Structuralgenomics projects in Japan. Nature Struct. Biol. 7,943–945.

23. Chothia, C. & Finkelstein, A. V. (1990). The classi-fication and origins of protein folding patterns.Annu. Rev. Biochem. 59, 1007–1039.

24. Holm, L. & Sander, C. (1993). Protein structure com-parison by alignment of distance matrices. J. Mol.Biol. 233, 123–138.

25. Grindley, H. M., Artymiuk, P. J., Rice, D. W. &Willett, P. (1993). Identification of tertiary structureresemblance in proteins using maximal subgraphisomorphism algorithm. J. Mol. Biol. 229, 707–721.

26. Gilbrat, J., Madej, T. & Bryant, S. H. (1996). Surpris-ing similarities in the structure comparison. Curr.Opin. Struct. Biol. 6, 377–385.

27. Murzin, A. G. (1992). Familiar strangers. Nature,360, 635.

28. Pearl, F. M. G., Lee, D., Bray, J. E., Sillitoe, I., Todd,A. E., Harrison, A. P. et al. (2000). Assigninggenomic sequences to CATH. Nucl. Acids Res. 28,277–282.

29. Murzin, A. G., Brenner, S. E., Hubbard, T. &Chothia, C. (1995). SCOP: a structural classificationof proteins database for the investigation ofsequences and structures. J. Mol. Biol. 247, 536–539.

30. LoConte, L., Ailey, B., Hubbard, T. J. P., Brenner,S. E., Murzin, A. G. & Chothia, C. (2000). SCOP: astructural classification of proteins database. Nucl.Acids Res. 28, 257–259.

31. Holm, L. & Sander, C. (1997). New structure: novelfold? Structure, 5, 165–171.

32. Holm, L. & Sander, C. (1998). Dictionary ofrecurrent domains in protein structures. Proteins:Struct. Funct. Genet. 33, 88–96.

33. Holm, L. & Sander, C. (1998). Touring protein foldspace with DALI/FSSP. Nucl. Acids Res. 26,316–319.

34. Gregory, D. S., Martin, A. C., Cheetham, J. C. &Rees, A. R. (1993). The prediction and characteriz-ation of metal binding sites in proteins. Protein Eng.6, 29–35.

35. Wallace, A. C., Laskowski, R. A. & Thornton, J. M.(1996). Derivation of 3D coordinate templates for

searching structural databases: application to Ser-His-Asp catalytic triads in the serine proteinasesand lipases. Protein Sci. 5, 1001–1013.

36. Artymiuk, P. J., Poirette, A. R., Grindley, H. M.,Rice, D. W. & Willett, P. (1994). A graph-theoreticapproach to the identification of three-dimensionalpatterns of amino acid side-chains in proteinstructures. J. Mol. Biol. 243, 327–344.

37. Russell, R. B. (1998). Detection of protein three-dimensional side-chain patterns:new examples ofconvergent evolution. J. Mol. Biol. 279, 1211–1227.

38. Smith, R. F. & Smith, T. F. (1990). Automatic gener-ation of primary sequence patterns from sets ofrelated protein sequences. Proc. Natl Acad. Sci.USA, 97, 118–122.

39. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu,J. S., Neuwald, A. F. & Wootton, J. C. (1993). Detect-ing subtle sequence signals: a Gibbs samplingstrategy for multiple alignment. Science, 262,208–214.

40. Rigoutsos, I., Floratos, A., Ouzounis, C., Gao, Y. &Parida, L. (1999). Dictionary building via unsuper-vised hierarchical motif disacovery in the sequencespace of natural proteins. Proteins: Struct. Funct.Genet. 37, 264–277.

41. Saqi, M. A. & Sternberg, M. J. E. (1994). Identifi-cation of sequence motifs from a set of proteinswith related function. Protein Eng. 7, 165–171.

42. Bairoch, A. (1993). The PROSITE dictionary of sitesand patterns in proteins, its current status. Nucl.Acids Res. 21, 3097–3103.

43. Bairoch, A., Bucher, P. & Hofmann, K. (1997). ThePROSITE database, its status in 1997. Nucl. AcidsRes. 25, 217–221.

44. Attwood, T. K., Beck, M. E., Bleasby, A. J. & Parry-Smith, D. J. (1994). PRINTS—a database of proteinmotif fingerprints. Nucl. Acids Res. 22, 3590–3596.

45. Attwood, T. K., Croning, M. D. R., Flower, D. R.,Lewis, A. P., Mabey, J. E. & Scordis, P. (2000).PRINTS-S: the database formerly known asPRINTS. Nucl. Acids Res. 28, 225–227.

46. Henikoff, S., Pietrokovski, S. & Henikoff, J. G.(1998). Superior performance in protein homologydetection with the Blocks Database servers. Nucl.Acids Res. 26, 309–312.

47. Kasuya, A. & Thornton, J. M. (1999). Three-dimen-sional structure analysis of prosite patterns. J. Mol.Biol. 286, 1673–1691.

48. Sigler, P. B., Blow, D. M., Matthews, B. W. &Henderson, R. (1968). Structure of crystalline-chymotrypsin. II. A preliminary report including ahypothesis for the activation mechanism. J. Mol.Biol. 35, 143–164.

49. Carter, P. & Wells, J. (1988). Dissecting the catalytictriad of a serine protease. Nature, 332, 564–568.

50. Dodson, G. & Wlodawer, A. (1998). Catalytic triadsand their relatives. Trends Biochem. Sci. 23, 347–352.

51. Brady, L. a., Brzozowski, A. M., Derewenda, Z. S.,Dodson, E., Dodson, G., Tolley, S. et al. (1990). Aserine protease triad forms the catalytic centre of atriacylglycerol lipase. Nature, 343, 767–770.

52. Schrag, J. D., Li, Y. G., Wu, S. & Cygler, M. (1991).Ser-His-Glu triad forms the catalytic site of thelipase from Geotrichum candidum. Nature, 351,761–764.

53. Newman, M., Watson, F., Roychowdhury, P., Jones,H., Badasso, M., Cleasby, A., Wood, S. P., Tickle, I. J.& Blundell, T. L. (1993). X-ray analysis of asparticproteinases V: structure and refinement at 2.0 A

975

Page 22: Functional Sites in Protein Families Uncovered via an Objective and

resolution of the aspartic proteinase from Mucorpusillus. J. Mol. Biol. 230, 260–283.

54. Solomon, E. I., Sundaram, U. M. & Machonkin, E. T.(1996). Multicopper oxidases and oxygenases. Chem.Rev. 96, 2563–2605.

55. Zhang, F. & Jetten, A. M. (2001). Genomic structureof the gene encoding the human GLI-related, Krup-pel-like zinc finger protein GLIS2. Gene, 280, 49–57.

56. Bron, C. & Kerbosch, J. (1971). Algorithm 457-find-ing all cliques of an undirected graph. Commun.A.C.M. 16, 575–577.

57. Whiting, A. K. & Peticolas, W. L. (1994). Details ofthe acyl–enzyme intermediate and the oxyanionhole in serine protease catalysis. Biochemistry, 33,552–561.

58. Qiu, X., Culp, J. S., DiLella, A. G., Hellmig, B.,Hoog, S. S., Janson, C. A. et al. (1996). Unique foldand active site in cytomegalovirus protease. Nature,383, 275–279.

59. Krem, M. M. & Di Cera, E. (2001). Molecularmarkers of serine protease evolution. EMBO J. 20,3036–3045.

60. Barth, A., Wahab, M., Brandt, W. & Frost, K. (1993).Classification of serine proteases derived from stericcomparisons of their active sites. Drug Des.Discovery, 10, 297–317.

61. Swain, A. L., Kretsingerl, R. H. & Amma, E. L.(1989). Restrained least squares refinement of native(calcium) and cadmium-substituted carp parval-bumin using X-ray crystallographic data at 1.6-Aresolution. J. Biol. Chem. 264, 16620–16628.

62. Ames, J. B., Dizhoor, A. M., Ikura, M., Palczewski,K. & Stryerii, L. (1999). Three-dimensional structureof guanylyl cyclase activating protein-2, a calcium-sensitive modulator of photoreceptor guanylylcyclases. J. Biol. Chem. 274, 19329–19337.

63. Slupsky, C. M., Desautels, M., Huebert, T., Zhao, R.,Hemmingsen, S. M. & McIntosh, L. P. (2001). Struc-ture of Cdc4p, a contractile ring protein essentialfor cytokinesis in schizosaccharomyces pombe.J. Biol. Chem. 276, 5943–5951.

64. Audette, G. F., Vandonselaar, M. & Delbaere, L. T. J.(2000). The 2.2 A resolution structure of the O(H)blood-group-specific lectin I from Ulex europaeus.J. Mol. Biol. 304, 423–433.

65. Wah, D. A., Romero, A., del Sol, F. G., Cavada, B. S.,Ramos, M. V., Grangeiro, T. B. et al. (2001). Crystalstructure of native and Cd/Cd-substituted diocleaguianensis seed lectin. A novel manganese-bindingsite and structural basis of dimer–tetramer associ-ation. J. Mol. Biol. 310, 885–894.

66. Hamelryck, T. W., Moore, J. G., Chrispeels, M. J.,Loris, R. & Wyns, L. (2000). The role of weakprotein–protein interactions in multivalent lectin–carbohydrate binding: crystal structure of cross-linked FRIL. J. Mol. Biol. 299, 875–883.

67. Varela, P. F., Dolores, S., Dıaz-Maurino, T., Kaltner,H., Gabius, H.-J. & Romero, A. (1999). The 2.15 Acrystal structure of CG-16, the developmentallyregulated homodimeric chicken galectin. J. Mol.Biol. 294, 537–549.

68. Ay, J., Tz, F. G., Borriss, R. & Heinemann, U. (1998).Structure and function of the Bacillus hybridenzyme GluXyn-1: native-like jellyroll fold pre-served after insertion of autonomous globulardomain. Proc. Natl Acad. Sci. USA, 95, 6613–6618.

69. Schrag, J. D. & Cygler, M. (1993). 1.8 A refinedstructure of the lipase from Geotrichum candidum.J. Mol. Biol. 230, 575–591.

70. Scheib, H., Pleiss, J., Kovac, A., Paltauf, F. &Schmid, R. D. (1999). Stereoselectivity of mucoraleslipases toward triradylglycerols—a simple solutionto a complex problem. Protein Sci. 8, 215.

71. Derewenda, U., Brzozowski, A. M., Lawson, D. M.& Derewenda, Z. S. (1992). Catalysis at the interface:the anatomy of a conformational change in a tri-glyceride lipase. Biochemistry, 31, 1532.

72. Sandy, A., Mushtaq, A., Kawamura, A., Sinclair, J.,Sim, E. & Noble, M. (2002). The structure of aryl-amine N-acetyltransferase from Mycobacteriumsmemgatis—an enzyme which inactivates the antitubercular drug, Isoniazid. J. Mol. Biol. 318,1071–1083.

73. Johnston, S. C., Larsen, C. N., Cook, W. J.,Wilkinson, K. D. & Hill, C. P. (1997). Crystalstructure of a deubiquitinating enzyme (humanUCH-L3) at 1.8 A resolution. EMBO J. 16,3787–3796.

74. Zhao, B., Janson, C. A., Amegadzie, B. Y., D’Alessio,K., Griffin, C., Hanning, C. R. et al. (1997). Crystalstructure of human osteoclast cathepsin K complexwith E-64. Nature Struct. Biol. 4, 109–111.

75. Menard, R., Carriere, J., Laflamme, P., Plouffe, C.,Khouri, H. E., Vernet, T. et al. (1991). Contributionof the glutamine 19 side chain to transition-statestabili-zation in the oxyanion hole of papain.Biochemistry, 30, 8924–8928.

76. Sinclair, J. C., Sandy, J., Delgoda, R., Sim, E. &Noble, M. E. (2000). Structure of arylamine N-acetyltransferase reveals a catalytic triad. NatureStruct. Biol. 7, 560.

77. Murthy, S. N. P., Iismaa, S., Begg, G., Freymann,D. M., Graham, R. M. & Lorand, L. (2002). Con-served tryptophan in the core domain of transgluta-minase is essential for catalytic activity. Proc. NatlAcad. Sci. USA, 2002, 2738–2742.

78. Weiss, M. S., Metzner, H. J. & Hilgenfeld, R. (1998).Two non-proline cis peptide bonds may be import-ant for factor XIII function. FEBS Letters, 423,291–296.

79. Noguchi, K., Ishikawa, K., Yokoyama, K., Ohtsuka,T., Nio, N. & Suzuki, E. J. (2001). Crystal structureof red sea bream transglutaminase. J. Biol Chem.276, 12055–12059.

80. Cobessi, D., Huang, L.-S., Ban, M., Pon, N. G.,Daldal, F. & Berry, E. A. (2002). The 2.6 A resolutionstructure of Rhodobacter capsulatus bacterioferritinwith metal-free dinuclear site and heme iron in acrystallographic “special position”. Acta Crystallog.D58, 29–38.

81. Kauppi, B., Nielsen, B. B., Ramaswamy, S., Larsen,I. K., Thelander, M., Thelander, L. & Eklund, H.(1996). The Three-dimensional structure of mam-malian ribonucleotide reductase protein R2 revealsa more-accessible iron-radical site than Escherichiacoli R2. J. Mol. Biol. 262, 706–720.

82. Stillman, T. J., Hempstead, P. D., Artymiuk, P. J.,Andrews, S. C., Hudson, A. J., Treffry, A. et al.(2001). The high-resolution X-ray crystallographicstructure of the ferritin (EcFtnA) of Escherichia coli;comparison with Human H Ferritin (HuHF) andthe structures of the Fe3þ and Zn2þ derivatives.J. Mol. Biol. 307, 587–603.

83. Voegtli, W. C., Ge, J., Perlstein, D. L., Stubbe, J. &Rosenzweig, A. C. (2001). Structure of the yeastribonucleotide reductase Y2Y4 heterodimer. Proc.Natl Acad. Sci. USA, 98, 10073–10078.

976

Page 23: Functional Sites in Protein Families Uncovered via an Objective and

84. Kerfeld, C. A., Anwar, H. P., Interrante, R.,Merchant, S. & Yeates, T. O. (1995). The structure ofchloroplast cytochrome c6at 1.9 A resolution:evidence for functional oligomerization. J. Mol.Biol. 250, 627–647.

85. Kadziola, A., Abe, J., Svensson, B. & Aser, R. (1994).Crystal and molecular structure of barley alpha-amylase. J. Mol. Biol. 239, 104–121.

86. Mirza, O., Skov, L. K., Remaud-Simeon, M., deMontalk, G. P., Albenne, C., Monsan, P. & Gajhede,M. (2001). Crystal structures of amylosucrase fromNeiseria polysaccharea in complex with D-Glucoseand the active site mutant Glu328Gln in complexwith the natural substrate suctorse. Biochemistry, 40,9032–9039.

87. Brayer, G. D., Sidhu, G., Maurus, R., Rydberg, E. H.,Braun, C., Wang, Y. et al. (2000). Subsite mapping ofthe human pancreatic alpha-amylase active sitethrough structural, kinetic and mutagenesis studies.Biochemistry, 39, 4778–4791.

88. Fujimoto, Z., Takase1, K., Doui, N., Momma1, M.,Matsumoto1, T. & Mizuno, H. (1998). Crystal struc-ture of a catalytic-site mutant a-amylase fromBacillus subtilis complexed with maltopentaose.J. Mol. Biol. 277, 393–407.

89. Newman, M., Safro, M., Frazao, C., Khan, G.,Zdanov, A., Tickle, I. J. et al. (1991). X-ray analysisof aspartic proteinases IV: structure and refinementat 2.2 A resolution of bovine chymosin. J. Mol. Biol.221, 1295–1309.

90. Singh, G., Gourinath, S., Sharma, S., Paramasivam,M., Srinivasan, A. & Singh, T. P. (2001). Sequenceand crystal structure determination of a basicphospholipase A2 from common krait (Bungaruscaeruleus) at 2.4 A resolution: identification andcharacterization of its pharmacological sites. J. Mol.Biol. 307, 1049–1059.

91. Fremont, D. H., Anderson, D. H., Wilson, I. A.,Dennis, E. A. & Xuong, N.-H. (1993). Crystal struc-ture of phospholipase A2 from Indian cobra revealsa trimeric association. Proc. Natl Acad. Sci. USA, 90,342–346.

92. Carrendano, E., Westerlund, B., Persson, B.,Saarinen, M., Ramaswamy, S., Baker, D. & Eklund,H. (1998). The three dimensional structures of twotoxins from snake venom throw light on the anti-coagulant and neurotoxic sites of phospholipaseA2. Toxicon, 36, 75–92.

93. Brunie, S., Bolin, J., Gewirth, D. & Sigler, P. B. (1985).The refined crystal structure of dimeric phospho-lipase A2 at 2.5 A: access to a shielded catalytic cen-ter. J. Biol. Chem. 260, 9742–9749.

94. Wang, X., Yang, J., Gui1, L., Lin, Z., Chen, Y. &Zhou, Y. (1996). Crystal structure of an acidic phos-pholipase A2 from the venom of agkistrodon halyspallas at 2.0 A resolution. J. Mol. Biol. 255, 669–676.

95. Newman, M., Lunnen, K., Wilson, G., Greci, J.,Schildkraut, I. & Phillips, S. E. V. (1998). Crystalstructure of restriction endonuclease BglI bound toits interrupted DNA recognition sequence. EMBOJ. 17, 5466–5476.

96. van der Woerd, M. J., Pelletier, J. J., Xu, S. &Friedman, A. M. (2001). Restriction enzyme BsoBI–DNA complex: a tunnel for recognition of degener-ate DNA sequences and potential histidinecatalysis. Structure, 9, 133–144.

97. Ban, C. & Yang, W. (1998). Structural basis for MutHactivation in E. coli mismatch repair and relation-

ship of MutH to restriction endonucleases. EMBOJ. 17, 1526–1534.

98. Grazulis, S., Deibert, M., Rimseliene, R., Skirgaila,R., Sasnauskas, G. & Lagunavicius, A. (2002).Crystal structure of theBse634I restriction endo-nuclease: comparison of the two enzymes recogniz-ing the same DNA sequence. Nucl. Acids Res. 30,876–885.

99. Allocati, N., Casalone, E., Masulli, M., Polekhina,G., Rossjohn, J., Parker, M. W. & Di Ilio, C. (2000).Evaluation of the role of two conserved active-siteresidues in beta class glutathione S-transferases.Biochem. J. 351, 341–346.

100. Sun, Y.-J., Kuan, I., Tam, M. F. & Hsiao, C.-D. (1998).The three-dimensional structure of an avian class-mu glutathione S-transferase, cGSTM1-1 at 1.94 Aresolution. J. Mol. Biol. 278, 239–252.

101. Liang, J., Chen, J. K., Schreiber, S. L. & Clardy1, J.(1996). Crystal Structure of P13K SH3 domain at2.0 A resolution. J. Mol. Biol. 257, 632–643.

102. Wittenkind, M., Mapelli, C., Farmer, B. T., II,Suen, K. & Goldfarb, V. (1994). Orientation ofpeptide fragments from Sos proteins bound tothe N-terminal SH3 domain of Grb2 deter-mined by NMR spectroscopy. Biochemistry, 33,13531–13539.

103. Politoul, A. S., Millevoi, S., Gautel, M., Kolmerer, B.& Pastore, A. (1998). SH3 in muscles: solutionstructure of the SH3 domain from nebulin. J. Mol.Biol. 276, 189–202.

104. Pavletich, N. P. & Pabo, C. O. (1993). Crystalstructure of a five-finger GLI-DNA complex: newperspectives on zinc fingers. Science, 261,1701–1707.

105. Klug, A. & Schwabe, J. W. (1995). Protein motifs 5.Zinc fingers. Fed. Am. Soc. Exp. Biol. J. 9, 597–604.

106. Kinzler, K. W. & Vogelstein, B. (1990). The GLI geneencodes a nuclear protein which binds specificsequences in the human genome. Mol. Cell. Biol. 10,634–642.

107. Ruppert, J. M., Vogelstein, B., Arheden, K. &Kinzler, K. W. (1990). GLI3 encodes a 190-kilodaltonprotein with multiple regions of GLI similar-ity.Mol. Cell. Biol. 10, 5408–5415.

108. Aruga, J., Yokota, N., Hashimoto, M., Furuichi, T.,Fukuda, M. & Mikoshiba, K. (1994). A novel zincfinger protein, zic, is involved in neurogenesis,especially in the cell lineage of cerebellar granulecells. J. Neurochem. 63, 1880–1890.

109. Aruga, J., Yozu, A., Hayashizaki, Y., Okazaki, Y.,Chapman, V. M. & Mikoshiba, K. (1996). Identi-fication and characterization of Zic4, a newmember of the mouse Zic gene family. Gene, 172,291–294.

110. Nagai, T., Aruga, J., Takada, S., Gunther, T., Sporle,R., Schughart, K. & Mikoshiba, K. (1997). Theexpression of the mouse Zic1, Zic2, and Zic3 genesuggests an essential role for Zic genes in bodypattern formation. Dev. Biol. 182, 299–313.

111. Berg, J. M. (1990). Zinc finger domains: hypothesesand current knowledge. Annu. Rev. Biophys. Biophys.Chem. 19, 405–421.

112. Berg, J. M. & Godwin, H. A. (1997). Lessons fromzinc-binding peptides. Annu. Rev. Biophys. Biomol.Struct. 26, 357–371.

113. Ingelman, M., Bianchi, V. & Eklund, H. (1997). Thethree-dimensional structure of flavodoxin reductasefrom Escherichia coli at 1.7 A resolution. J. Mol. Biol.268, 147–157.

977

Page 24: Functional Sites in Protein Families Uncovered via an Objective and

114. Serre, L., Vellieux, F. M. D., Medina, M., Gomez-Moreno, C., Fontecilla-Camps, J. C. & Frey, M.(1996). X-ray structure of the ferredoxin: NADPþreductase from the Cyanobacterium anabaena PCC7119 at 1.8 A resolution, and crystallographicstudies of NADPþ binding at 2.25 A resolution.J. Mol. Biol. 263, 20–39.

115. Bruns, C. M. & Karplus, P. A. (1995). Refined crystalstructure of spinach ferredoxin reductase at 1.7 Aresolution: oxidized, reduced and 20-phospho-50-AMP bound states. J. Mol. Biol. 247, 125–145.

116. Henriksen, A., Welinde, K. G. & Gajhede, M. (1998).Structure of barley grain peroxidase refined at 1.9-Aresolution a plant peroxidase reversibly inactivatedat neutral pH. J. Biol. Chem. 273, 2241–2248.

117. Choinowski, T., Blodig, W., Winterhalter, K. H. &Piontek, K. (1999). The crystal structure of ligninperoxidase at 1.70 A resolution reveals a hydroxygroup on the Cb of tryptophan 171: a novel radicalsite formed during the redox cycle. J. Mol. Biol. 286,809–827.

978