18
Quantifying the Similarities within Fold Space Andrew Harrison 1 * , Frances Pearl 1 , Richard Mott 2 , Janet Thornton 1,3 and Christine Orengo 1 1 Biomolecular Structure and Modelling Unit Department of Biochemistry and Molecular Biology University College London Gower Street, London WC1E 6BT, UK 2 Wellcome Trust Centre for Human Genetics Roosevelt Drive, Oxford OX3 7BN, UK 3 European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK We have used GRATH, a graph-based structure comparison algorithm, to map the similarities between the different folds observed in the CATH domain structure database. Statistical analysis of the distributions of the fold similarities has allowed us to assess the significance for any simi- larity. Therefore we have examined whether it is best to represent folds as discrete entities or whether, in fact, a more accurate model would be a continuum wherein folds overlap via common motifs. To do this we have introduced a new statistical measure of fold similarity, termed gregarious- ness. For a particular fold, gregariousness measures how many other folds have a significant structural overlap with that fold, typically comprising 40% or more of the larger structure. Gregarious folds often contain com- monly occurring super-secondary structural motifs, such as b-meanders, greek keys, a b plait motifs or a-hairpins, which are matching similar motifs in other folds. Apart from one example, all the most gregarious folds matching 20% or more of the other folds in the database, are a b proteins. They also occur in highly populated architectural regions of fold space, adopting sandwich-like arrangements containing two or more layers of a-helices and b-strands. Domains that exhibit a low gregariousness, are those that have very distinctive folds, with few common motifs or motifs that are packed in unusual arrangements. Most of the superhelices exhibit low gregarious- ness despite containing some commonly occurring super-secondary struc- tural motifs. In these folds, these common motifs are combined in an unusual way and represent a small proportion of the fold (, 10%). Our results suggest that fold space may be considered as continuous for some architectural arrangements (e.g. a b sandwiches), in that super-secon- dary motifs can be used to link neighbouring fold groups. However, in other regions of fold space much more discrete topologies are observed with little similarity between folds. q 2002 Elsevier Science Ltd. All rights reserved Keywords: fold space; GRATH; fold similarity; CATH; gregariousness *Corresponding author Introduction Here we report on significant structural overlaps between folds and how these similarities are distributed across the set of known structures, also described as “fold space”. There is considerable interest in this distribution as highly recurrent motifs may be associated with favourable fold- ing arrangements of secondary structures and similarities between fold groups may reveal evolu- tionary mechanisms for extending the protein structure repertoire. Many previous analyses of structural similarity have concentrated mainly on identifying global structural relationships. For example the CATH 1 classification of protein folds gives a discrete description of fold space. Currently approximately 750 folds are identified using a robust structure comparison algorithm. Empirical criteria are used for classifying proteins into these fold groups. Classifications such as SCOP 2 and CATH are often used to provide fold libraries for structure predic- tion algorithms such as threading, 3 which attempt to fit sequences to 3D structures by optimising energy profiles. In this context, there has been con- siderable discussion as to whether it is appropriate to consider folds as discrete entities or whether a continuum of folds exists. In the latter case, infor- mation on putative structural neighbours would 0022-2836/02/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved E-mail address of the corresponding author: [email protected] Abbreviations used: CASP, competitive assessment of structure prediction. doi:10.1016/S0022-2836(02)00992-0 available online at http://www.idealibrary.com on B w J. Mol. Biol. (2002) 323, 909–926

Quantifying the Similarities within Fold Space

Embed Size (px)

Citation preview

Quantifying the Similarities within Fold Space

Andrew Harrison1*, Frances Pearl1, Richard Mott2, Janet Thornton1,3

and Christine Orengo1

1Biomolecular Structureand Modelling UnitDepartment of Biochemistryand Molecular BiologyUniversity College LondonGower Street, LondonWC1E 6BT, UK

2Wellcome Trust Centrefor Human GeneticsRoosevelt Drive, OxfordOX3 7BN, UK

3European BioinformaticsInstitute, Wellcome TrustGenome Campus, CambridgeCB10 1SD, UK

We have used GRATH, a graph-based structure comparison algorithm, tomap the similarities between the different folds observed in the CATHdomain structure database. Statistical analysis of the distributions ofthe fold similarities has allowed us to assess the significance for any simi-larity. Therefore we have examined whether it is best to represent folds asdiscrete entities or whether, in fact, a more accurate model would be acontinuum wherein folds overlap via common motifs. To do this we haveintroduced a new statistical measure of fold similarity, termed gregarious-ness. For a particular fold, gregariousness measures how many other foldshave a significant structural overlap with that fold, typically comprising40% or more of the larger structure. Gregarious folds often contain com-monly occurring super-secondary structural motifs, such as b-meanders,greek keys, a–b plait motifs or a-hairpins, which are matching similarmotifs in other folds. Apart from one example, all the most gregariousfolds matching 20% or more of the other folds in the database, are a–bproteins. They also occur in highly populated architectural regions offold space, adopting sandwich-like arrangements containing two or morelayers of a-helices and b-strands.

Domains that exhibit a low gregariousness, are those that have verydistinctive folds, with few common motifs or motifs that are packed inunusual arrangements. Most of the superhelices exhibit low gregarious-ness despite containing some commonly occurring super-secondary struc-tural motifs. In these folds, these common motifs are combined in anunusual way and represent a small proportion of the fold (,10%). Ourresults suggest that fold space may be considered as continuous for somearchitectural arrangements (e.g. a–b sandwiches), in that super-secon-dary motifs can be used to link neighbouring fold groups. However, inother regions of fold space much more discrete topologies are observedwith little similarity between folds.

q 2002 Elsevier Science Ltd. All rights reserved

Keywords: fold space; GRATH; fold similarity; CATH; gregariousness*Corresponding author

Introduction

Here we report on significant structural overlapsbetween folds and how these similarities aredistributed across the set of known structures, alsodescribed as “fold space”. There is considerableinterest in this distribution as highly recurrentmotifs may be associated with favourable fold-ing arrangements of secondary structures andsimilarities between fold groups may reveal evolu-tionary mechanisms for extending the proteinstructure repertoire.

Many previous analyses of structural similarityhave concentrated mainly on identifying globalstructural relationships. For example the CATH1

classification of protein folds gives a discretedescription of fold space. Currently approximately750 folds are identified using a robust structurecomparison algorithm. Empirical criteria are usedfor classifying proteins into these fold groups.Classifications such as SCOP2 and CATH are oftenused to provide fold libraries for structure predic-tion algorithms such as threading,3 which attemptto fit sequences to 3D structures by optimisingenergy profiles. In this context, there has been con-siderable discussion as to whether it is appropriateto consider folds as discrete entities or whether acontinuum of folds exists. In the latter case, infor-mation on putative structural neighbours would

0022-2836/02/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved

E-mail address of the corresponding author:[email protected]

Abbreviations used: CASP, competitive assessment ofstructure prediction.

doi:10.1016/S0022-2836(02)00992-0 available online at http://www.idealibrary.com onBw

J. Mol. Biol. (2002) 323, 909–926

be of considerable value to threading and otherrelated prediction algorithms.

Analyses of similarities in the sets of knownsequences and structures have suggested thatthere are a limited number of protein folds innature, estimated to be of the order of one toseveral thousand4,5 and that not all possibletopologies have been sampled. This may be dueto constraints on secondary structure packing.Alternatively, the limited repertoire of folds appar-ently sampled may reflect the preponderance ofsome evolutionary families, selected for theirpreferred thermodynamic or functional properties.Whatever the reason it is very clear from severalanalyses1,6 that fold groups are not uniformlypopulated. Some folds are adopted by a singlehomologous superfamily, whereas others are usedby many families. The most frequently usedfolds have been described as “superfolds”5 or“frequently occuring domains”.6 In the CATHdatabase the five most highly populated foldgroups currently account for nearly 20% of thehomologous superfamilies.

Increasingly, analyses of protein structural simi-larity challenge our definitions of the term “fold”and the allocation of a single fold description toa particular evolutionary superfamily. With thehuge increases in the number of sequences deter-mined by the international genome initiatives,driving the need to improve prediction of proteinstructures, it is important to reconsider our viewsof fold space and the manner in which structuralsimilarities between proteins are assessed.

To date, fewer analyses have been performed toidentify local motif-based similarities betweenfolds. Several recurrent super-secondary motifshave been identified, initially from manual ana-lyses of protein structures, e.g. a–b motifs,7

b-hairpins8 and a–b plait motifs.9 A few motif-based databases have been established to recordthe characteristics of these motifs and where theyoccur in known protein structures (for examplePROMOTIF10).

Larger structural motifs occurring in differentprotein folds may hint at diverse evolutionaryrelationships. A number of recent reviews haveproposed mechanisms whereby apparently distinctfolds possessing common motifs, often of func-tional importance, may represent distant evolu-tionary relatives with similar core modulesembellished in different ways, thereby giving riseto globally distinct folds.11,12 Similarly, severalexamples have been cited of homologous proteins,whose fold has been effectively changed by localstructural rearrangements, for example the flippingof a b-hairpin, which alter the local topologybut not the architecture of the protein. Sometimes,protein function is retained, but catalytic residuesare now contributed from different regions of thepolypeptide, though they may be co-located in 3D.

Previous analysis of the CATH databaserevealed some highly populated regions of foldspace, where there is considerable structural over-

lap between folds and corresponding difficulties13

in identifying distinct folds. In these regions thereappears to be a continuum of fold motifs availableto proteins. In fact, it was suggested that for someprotein architectures, particularly sandwich-likearchitectures, comprising layers of b-sheets ora-helices and b-sheets, the structural universeshould perhaps be viewed more as a continuum.

In these regions of fold space, visual inspectionof structural similarities has revealed some motifscommon to proteins having distinct overall folds.In many cases as the size of the protein increases,the repertoire of folds available appears to consistof extensions to existing motifs.13 For instance,Sippl and co-workers14 have shown how it ispossible to walk from one a/b sandwich fold toanother, through the extension of a/b motifs.Furthermore, certain motifs, described as attrac-tors, occur as the core of a protein’s structuremore frequently than others.15 Moreover, foldspace shows similarities at granularities differentfrom that usually considered.16

In order to examine these similarities system-atically it is necessary to devise a sufficientlyrapid method for comparing structures that willallow the relationships between all known struc-tures to be explored. It is also necessary to employa reliable quantitative method for measuring anystructural similarities and assessing their signifi-cance. Although many structure comparisonmethods17 have been developed over the last 30years, many of the most accurate methods aretoo slow to allow extensive database comparisons.For example, the residue-based SSAP18 algorithmrequires approximately two years to perform anall against all comparison of 1000 average-sizedstructures with a single processor.

Many structural classifications and structuralneighbour lists (e.g. DALI,19 VAST,20 CAMPASS,21

and SSAPc22) employ rapid secondary structure-based methods as a filter for identifying potentiallyrelated structures which are then re-comparedusing the slow, accurate residue-based methods.Often these methods use simple empiricalapproaches for measuring structural similarity.For example, the root mean square deviation aftersuperposition is commonly used as a measure ofstructural similarity though this is very dependenton the sizes of proteins being compared and thenumber of residues superposed. More recently,attempts have been made to devise more rigorousstatistical approaches include the measurements ofZ-scores16,20,23 and probability (P ) and expectation(E ) values.24,25 However, many of these statisticalframeworks have been developed for the muchslower residue-based comparison methods.

Although the structure comparison methods24,25

employ sound statistical frameworks for assessingsignificance of putative matches, their residue-based approaches are still too slow to allow exten-sive database comparison. Furthermore, unlike therealm of sequence alignment where there has beenextensive benchmarking, to date, there has been

910 Quantifying the Similarities within Fold Space

little assessment of the reliability of structure com-parison methods and associated similarity scoresin identifying related protein structures. However,notably, recently benchmarked26 Z-scores appliedin the DALI domain database showed that at least77% of the homologous relationships automaticallydetected, agreed with manually validated relation-ships in the SCOP database.

We have developed a new method, GRATH(A.H., F.P., I. Sillitoe, T. Slidel, R.M., J.T. & C.O.,unpublished results), based on comparing second-ary structures and their relationships between pro-teins. GRATH is a rapid graph-based comparisonalgorithm inspired by the work of Grindley andco-workers27 but which we have extended in anumber of ways. The most important developmentis the design of a rigorous statistical approach forassessing the significance of any similarity detectedby the method. This method is based on identify-ing structural similarities without explicit consider-ation of sequence and function and so willpotentially recognise both analogues andhomologues.

Because GRATH compares only secondary struc-tures between proteins it is over three orders ofmagnitude faster than residue-based methods foraligning protein structures. Therefore it can beused as a front-end for the accurate but compu-tationally expensive algorithm SSAP,18 previouslyused to classify known structures in the PDB intofold groups and homologous superfamilies in theCATH database. Furthermore, because GRATH isso fast, it is possible to perform extensive allagainst all comparisons of domain structures inCATH. GRATH is ideally suited to studying thefrequency with which motifs are shared amongstfolds. Most importantly, analysis of the distri-butions of comparison scores returned by GRATHhas allowed us to develop a statistical measure forassessing the significance of any similarity betweentwo proteins.

We have performed a benchmarking of GRATH,using established structural relationships from theCATH database, from which we have determinedthat the significance for any fold match is a simplefunction of the sizes of the proteins compared.This is a major development over the simple algor-ithm introduced by Grindley and co-workers,which returned a raw score with no estimate of sig-nificance. Structural matches had to be manuallyassessed to reveal any biologically meaningfulinsights. By calculating a statistical significance weare able to use GRATH to perform extensive data-base comparisons, extracting only those matchesthat appear significant.

This considerably extends the use of the methodbeyond the original Grindley implementation andis analogous to the improvements obtained in thedevelopment of fast search methods28 for sequencedatabases. They also extended existing techniques,in their case based on the analysis of 2D scorematrices or dot-plots and the introduction of arigorous statistical framework to assess significance

meant that these simple approaches could be usedfor large scale screening of sequence databases.

Extensive application of GRATH to compare allknown protein structures, combined with thisrobust statistical framework, has allowed us todescribe the global properties of fold space in asimple form. This universal perspective of foldspace has followed from performing a size cali-bration to normalise the level of fold similarityacross fold space, i.e. analogous to setting the localsea-level. Importantly, this size calibration hasallowed us to introduce a measure of the local top-ology density, termed gregariousness. Gregarious-ness measures how many other folds have asignificant structural similarity to a particular fold,and yet have a different overall topology. It hasallowed us to determine the fraction of fold spacethat any fold matches, and we have used it todetail which regions of fold space are densely andsparsely occupied. Moreover, information on foldspossessing statistically significant similarities canbe used to rapidly extract lists of structural neigh-bours for any particular structural family. Thesemay be valuable for assessing the performance ofstructure prediction algorithms and for identifyingcommon motifs used as building blocks in ab initiomethods of structure prediction.

Results

Development of a scoring function for GRATHand assessment of performance

GRATH (A.H., F.P., I. Sillitoe, T. Slidel, R.M., J.T.& C.O., unpublished results) compares the axialvectors of a-helices and b-strands of two proteins,together with the distances, angles and chiralitybetween these vectors. The fold of a protein canthus be readily described by a graph in which the“nodes” correspond to the secondary structures,whilst the “edges” describe the geometric relation-ships between the secondary structures (see Figure1). The translation of proteins into graphs in thisway readily allows determination of the amountof overlap between the two proteins, expressed asthe largest clique† common to both proteins.

As suggested previously,27 the clique size provedto be the most important discriminator of folds.Initially we attempted to construct a theoreticalmodel for calculating the frequency with whichrandom graphs could be expected to exhibit asimilarity, i.e. would share a common clique, bycoincidence. This would readily allow us to assessthe significance of any matches found betweenprotein structure graphs. Several theoreticalapproaches were explored. However, as discussedin Appendix, protein structures and their associ-ated graphs have certain properties that make itvery difficult to devise statistical approaches

† A clique is a subgraph in which each node has edgesconnecting it to all the other nodes in the clique.

Quantifying the Similarities within Fold Space 911

based on simple theoretical models. For example,some connections between nodes in the graphsare simply not allowed as they contradictwell-established rules of protein folding, such aschirality of motifs and the rarity of knots in thepolypeptide chain.

Moreover, protein structures are far from ran-dom shapes, and the contents of the CATH data-base indicate that the fold world is far fromrandomly populated. Some folds, e.g. the TIMbarrel fold and especially the Rossmann fold,appear to have been adopted independently bymany different evolutionary families. Therefore,despite attempts to employ a theoreticallyderived statistical model, these considerationsled us to initially use a simple empirical scoringfunction, for GRATH, based on the size of theoverlap or clique, identified between two proteinstructures. This function measures the overlapbetween any two folds and is scaled so thatits scores range from 0 (no overlap) to 1 (perfectoverlap).

Several scoring functions were explored andbenchmarked using a representative dataset of1702 proteins, selected from each sequence familyin the CATH database (the Sreps data-set). Bench-marking protocols measured the percentage ofstructures in the data-set for which at least onerelative from the same fold group was found inthe top ten matches returned by scanning thecomplete data-set. See Methods for a more detaileddescription of the selection of the data-set and thebenchmarking protocols used. Representativeswere selected from each sequence family (Sreps),

rather than each fold group to capture the struc-tural variation that can occur across a fold group.This maximises the probability of a query structurematching at least one relative from its own foldgroup.

The scoring function, shown in equation (1)below gave the best combination of smooth scoredistributions for a statistical analysis (see below)and coverage:

S ¼

5

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiCS

SS1

CS

SS2

r !þ

MinðR1;R2Þ

MaxðR1;R2Þþ 2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiCR1

R1

CR2

R2

r !

8

ð1Þ

When the two proteins compared have SS1 and SS2secondary structures and R1 and R2 amino acidresidues, their comparison generates a largestclique of size CS and the largest clique is producedfrom a set of secondary structures in protein 1 thatcontain a total of CR1 residues and from a set ofsecondary structures in protein 2 that contain atotal of CR2 residues.

Coverage, measured as the percentage ofstructures for which the correct fold group can befound within the top ten matches from scansagainst the complete data-set, was above 95% inall protein classes. We also found that this function,which includes a normalisation for the clique sizeas well as information about the amount of resi-dues within the cliques, gave a better coveragethan that based simply on clique size.27

Statistical analysis of the significance ofsimilarity scores returned by GRATH

In order to study the statistics of fold compari-son, the scores returned by putative matches mustbe contrasted against those returned from scansagainst a large data-set of unrelated proteins, inthis case the Sreps data-set. To do this, one memberof each sequence family was compared against anexample fold from every other sequence family.However, because some fold groups contain verymany sequence families which may all yield highscores to the query structure, the resulting scoredistributions must be adjusted to prevent distor-tion of the distribution towards these high scores.Therefore, the highest score found amongst thesequence family representatives (Sreps) of eachfold was chosen, and the resulting distribution ofhighest scores is studied.

As an example, Figure 2, shows the distributionof scores obtained from comparing the armadillorepeat region from murine b-catenin (PDB 2bct,29

C ¼ 1, A ¼ 25, T ¼ 30) against the 580 distincttypes of fold in the CATH database (version 1.7).The tail of the histogram (i.e. the top 30% of thescores) follows a smooth distribution, in particularthat of a decaying exponential. This type of distri-bution is commonly referred to as an extremevalue distribution and is also observed in searchesof sequence databases using standard sequence

Figure 1. A generic graph of a domain with threesecondary structures. The nodes are vectorial represen-tations of the axis through each secondary structure.Node x is labelled by the secondary structure type, Lx

(helix or strand). The edges, between vectors x and y,are labelled by: the distance of closest approach, dxy;the dot-product angle, cxy; the dihedral angle, uxy; thechirality, chixy.

912 Quantifying the Similarities within Fold Space

comparison methods (e.g. BLAST, Smith-Waterman). It has also been observed for searchesof structural databases, using alignment methodsthat compare residue properties and employdynamic programming.24,25

Since the majority of the scores in the tail havebeen returned by folds unrelated to the querystructure, by mathematically fitting this tail wecan calculate the frequency which we can expectto observe any particular score, by chance, i.e.from an unrelated structure. The tail of the distri-bution can be described by the following equation:

F ¼ Ie2ks ð2Þ

where F is the number of matches for each score, s,k is the decay rate and I is a scaling variable.Taking the natural log results in:

log F ¼ log I 2 ks ð3Þ

Fitting this expression to the data results in a valuefor the intercept, which is equal to log I, and a

gradient, which is equal to the decay rate k. Sucha fit can be seen in Figure 3. Once I and k aredetermined, it is possible to calculate the expectedfrequency with which any score, for random struc-tures, will be observed.

Moreover, the expectation value statistic for ascore (E-value), which is the number of times ascore above a certain value will be observed by

Figure 2. A histogram of different fold scores for2bct00 (C ¼ 1, A ¼ 25, T ¼ 30) using the function inequation (1). The highest score in each fold category isrepresented. The tail of the distribution is smooth andso an empirical fit can be made.

Figure 3. A fit to the tail of the histogram of 2bct00,which ignores the obvious out-lier. The distribution isthat of an extreme-value distribution as it is a linear fitto the log.

Figure 4. The gradient is related to the graph size forall types of proteins. The mean and standard deviationof the different gradients at each graph size are dis-played. (a) a-Proteins, (b) b-proteins, (c) a/b-proteins.

Quantifying the Similarities within Fold Space 913

chance in a database search, also follows naturally.It is the integral of the distribution from the score,to the maximum score possible, one in the presentcase. Importantly, both k (Figure 4) and log I(Figure 5) are dependent on the size of the graph,G ( ¼ SS1), the number of secondary structures inthe domain to be analysed. Both the gradient andintercept data can be described by an equation of

the form:

y ¼ axb

A fit to such an equation for the data in Figure 4indicates that the gradient is:

k ø 2:8G0:75 ð4Þ

irrespective of whether the graph is an a, b or ana/b protein. Similarly, a fit to the data in Figure 5indicates that:

log I ø 4G0:37 ð5Þ

again independent of the class of the protein. Fromequation (2), it follows that the expected frequencyfor a random structure, with a score S and graphsize G, is:

FðS;GÞ ¼ expð4G0:37 2 2:8G0:75SÞ ð6Þ

The E-value for the score is:

EðS;GÞ ¼

ð1

S

FðS;GÞdS ð7Þ

Equation (6) indicates that the amount of similaritybetween members of the protein fold world, asviewed through GRATH, is not dependent on theclass of the protein but is dependent only onthe size of the protein. Once this is accounted for,the fold similarity between any two proteins canbe described in a simple and succinct manner.

One implication of this simplicity in the scoringscheme, is that it is possible to determine whatscore is required to reach a given significance. Inother words, we can easily calculate the scorethat will only occur by chance once between anytwo protein structures of a given graph size(frequency ¼ F(S,G ) ¼ 1). Rearranging equation(6) gives:

SðF ¼ 1Þ ¼ 1:43G20:38 ð8Þ

Figure 6 shows the scores predicted to occur onceby equation (8) compared to the scores obtainedonce from fits (at frequency F ¼ 1) to each of thescore distributions for different representatives inthe Sreps library. It can be seen that equation (8)provides an accurate description of the data for allgraphs that contain four or more nodes, irrespec-tive of whether GRATH is observing a, b or a/bdomains. It is only for the smallest domains thatequation (8) breaks down. Inspection of Figures 5and 6 indicates that it is for the smallest domainsthat the validity of equation (5), and henceequation (6), becomes questionable.

Another demonstration of the simplicity of ourapproach can be seen by determining how theexpected frequency varies as a function of score,for a given graph size. For example, how often isthe maximum score of one (S ¼ 1), obtained. Sincethe E-value for a particular score is given by thenumber of times scores of this value and aboveare obtained by chance, and since a score of one isthe maximum possible score, the E-value is simply

Figure 5. The intercept is related to the graph size forall types of proteins. The mean and standard deviationof the different intercepts at each graph size are dis-played. (a) a-Proteins; (b) b-proteins; (c) a/b-proteins.

914 Quantifying the Similarities within Fold Space

the expected frequency at which this score occurs.Rearranging equation (6) gives:

log10 FðS ¼ 1;GÞ ¼4G0:37 2 2:8G0:75

2:3ð9Þ

Figure 7 shows frequency values for scores of onepredicted from equation (9), compared to thefrequency observations obtained from separatefits to each of the distributions for representativesin the Sreps library. The fits are good, againconfirming that the simple expression writtenin equation (6) provides an accurate descriptionof structural similarities across the protein foldworld.

Figure 6. The score at which one example is expectedto be found. The data points are calculated from fits toeach of the unique distributions. The fit to the data isgenerated from the fits to k and log I. The fit, equation(8), is a simple function of size and is independent of theclass of proteins. (a) a-Proteins; (b) b-proteins; (c) a/b-proteins.

Figure 7. The frequency expected for a given graphsize of a score of 1. The fit, equation (9), is a simple func-tion of graph size and is independent of the class ofprotein. (a) a-Proteins; (b) b-proteins; (c) a/b-proteins.

Quantifying the Similarities within Fold Space 915

Adjusting the score distributions and statisticsfor fold size

As discussed above, the parameters for fittingthe score distributions for each Srep are dependent

on the size of the protein. This means that the sig-nificance of the highest scoring matches will notbe equivalent, for different sized proteins. Eachquery structure scanned against the Sreps data-setusing GRATH results in a distribution of scoresthat has its own set of E-values. Figure 8 showsthe expectation value (E-value) of the top hitobtained, from each Srep query, for the differentclasses of proteins. The trend with graph size indi-cates that the top hits for large graphs are of highersignificance than the hits for smaller graphs.

The significance of a GRATH score is a functionof size; smaller proteins are more likely to have alarge overlap score by chance. Simply studyingthe mean score or statistical significance for agiven fold maintains the size dependence. Thismakes it difficult to compare fold similaritiesacross fold space. However, it is possible to circum-vent this dependence by comparing how the scorefor a given size is related to the score expected forthat size.

Equation (8) predicts the score that is expected tooccur only once in a distribution. By ratioing theGRATH score, obtained using equation (1), againstthe score for which one match is expected for thatsize (equation (8)), the resulting distribution isnormalised (Figure 9) and is independent of size.Figure 9 shows the proportion of folds in theT reps library that exhibit a particular score ratio,for a typical distribution.

The size independence only breaks down forsmall proteins. As seen by equation (8), theexpected score for small proteins becomes greaterthan the allowed maximum value of one. There-fore, in our following analysis of fold space, onlydistributions for domains that contain four ormore nodes were considered. The size inde-pendence of the distributions thereby obtained isconfirmed by comparing all the distributions.

Figure 8. For each Srep the theoretical expectationvalue for the most significant match is plotted againstthe size of the query graph. For a given GRATH overlapscore, the folds for larger proteins are predicted with ahigher significance than those of small proteins. A Hitrepresents the fold with the highest score being correctlyidentified, a Miss represents an incorrect assignment. (a)a-Proteins; (b) b-proteins; (c) a/b-proteins.

Figure 9. A typical distribution obtained from com-paring a domain against the T reps library (one exampleselected for each fold in the CATH database). The Scoreratio is derived by dividing the GRATH score observed,from equation (2), by the score for which one fold isexpected (equation (8)). The number of counts are normal-ised by the number of different folds in the T reps library.

916 Quantifying the Similarities within Fold Space

Figure 10 aligns all the histograms in the form ofFigure 9, ranking them according to graph size.The ordering by fold number in Figure 10 is notdirectly proportional to the size of the graph, butit is clear that there is no obvious trend with sizeacross the histograms.

Any signal seen in Figure 10 is caused by con-straints on secondary structure packing, not byany biases introduced by GRATH. That is theremay still remain some dependence of score on thefold type or architecture. For example, some folds(e.g. ab-barrels) have many repetitive super-secondary motifs (e.g. ab motifs) that recur exten-sively in other proteins adopting different overallfolds (e.g. ab sandwiches). This shifts the distri-butions as it results in a greater proportion ofhigher scores than expected by chance. Normalis-ing by size will therefore not be sufficient to obtaina distribution similar to those given by otherfolds containing less popular structural motifs.This effect is discussed in more detail below.However, Figure 10 shows that this is a relativelysmall effect, as there are considerable similaritiesamongst the majority of distributions.

Analysis of fold gregariousness

Having normalised the score distributions foreach fold, we can now consider how similar agiven fold is to other folds in the library. In orderto quantify how well a fold matches to the rest ofthe database, the upper tails of the histograms inFigures 9 and 10 were studied (score ratio $0.7).In other words we measured the proportion offolds giving a score ratio $0.7. This value waschosen because a score ratio of 0.7 represents asignificant structural overlap, as well as containinga reasonable signal. The majority of overlapsbetween folds detected with this score ratio are onaverage four to five secondary structures and foraverage sized proteins of ,150 residues, this is

typically 40% of the secondary structures com-prising that fold (Figure 11).

The integral under these tails gives the fractionof reasonable matches to other representatives ofthe data-set, e.g. a value of 0.1 means 10% of thedata-set matches to the fold with a score ratio$0.7. It quantifies the number of close neighboursor how fond of company each fold is, and it istherefore termed the gregariousness measure.

In essence, gregariousness measures how manyother folds have a significant structural overlapwith a particular fold, and yet have a differentoverall topology. For example, eukaryotic trans-lation initiation factor eIF4E (PDB-ID 1ap8,30

C ¼ 3, A ¼ 30, T ¼ 760, a two-layer a/b sandwich)is particularly gregarious. This fold contains threea-helices packed against an eight-stranded anti-parallel b-sheet (see Figure 12). The centre of thisfold comprises a commonly occurring structuralmotif, comprising three b-strands and one a-helix,which can be described as an ab-meander. It isfound to occur in the core of many ab folds adopt-ing a two-layer sandwich architecture. However, italso occurs as a core within larger folds, and

Figure 11. Using all the significant matches (score ratio0.7) for the 50 most gregarious folds. (a) The mean cliquesize for gregarious matches is four to five secondarystructures. (b) The mean of the clique size to the geo-metric mean number of secondary structures is ,40%.

Figure 10. The ordering by size of normalised histo-grams (the grey-scale translates to the y-axis in Figure 9)for all folds whose graphs contain four or more nodes.There is no obvious size dependence, which indicatesthat the scaling in Figure 9 is consistent across the data-base. Gregariousness is defined as the sum of the scoreratio from 0.7 upwards.

Quantifying the Similarities within Fold Space 917

as such was also named as one of five commonstructural motifs,15 or attractors.

Several theories suggest themselves as to whythis ab-meander motif is prolific within foldspace: the conformation may be the result of aparticularly dominant folding pathway or it mayrepresent a particularly stable arrangement ofsecondary structures. Alternatively, such a motifmay have occurred early in evolutionary time andsubsequently been embellished in different waysgiving rise to different folds. However, it is becauseeukaryotic translation initiation factor containsthis commonly occurring substructure that it isidentified as gregarious.

Figure 12 also shows that eukaryotic translationinitiation factor also has significant structural over-laps with other fold groups, based on similaritiesin their b-sheets. The structural overlaps shown inthe figure are typically 70% of the smaller fold.These matches reflect the tight constraints on theconformations of b-sheets imposed by the physico-chemical requirements of hydrogen bondingbetween the b-strands. As such, they do not revealany biologically interesting similarity suggestiveof a putative evolutionary relationship but merelyreflect on the constraints imposed on b-sheetformation.

Another highly gregarious fold is the firstdomain from the envelope glycoprotein from tick-borne encephalitis (PDB 1svb,31 C ¼ 2, A ¼ 60,T ¼ 980, a two-layer b-sandwich), containing nineb-strands. This fold contains several frequentlyoccurring supersecondary structures e.g. ab-hairpin packed against a three-strandb-meander, observed in many of the other two-layer b-sandwiches.

Figure 13 shows significant structural matchesobserved with the human lysosomal aspartyl-glucosaminidase (PDB 1apy32 chain B, C ¼ 3,A ¼ 50, T ¼ 11, a three-layer bba-sandwich). Asfor eukaryotic translation initiation factor, many ofthese matches simply involve similarities in theb-sheets. However, the match to tetrahydro-biopterin synthase subunit A (PDB 1b66,33 C ¼ 3,A ¼ 30, T ¼ 479) involves a more unusual struc-tural motif comprising four b-strands and twoa-helices. Intriguingly, both proteins contain aligand-binding site at corresponding locationswithin this motif, though the molecules they bind(7N-methyl-8-hydroguanosine-50-diphosphate andbiopterin) are quite different. There is no obvioussimilarity in the functional properties of the twoproteins either, although this does not exclude thepossibility that the structural similarity may

Figure 12. 1ap800 is gregarious, highlighting that it has a significant overlap with many folds, including five shownhere.

918 Quantifying the Similarities within Fold Space

highlight a very ancient evolutionary relationship.During evolution, changes in the protein sequencesand associated structural changes could haveresulted in modifications to the protein function.34

Extensive structural changes occur betweensome very distant homologues.35 The extent ofstructural variation suggests12 that a fold changehas occurred between these relatives. Indeed,there are cases within CATH where the structuralsimilarity between relatives, as measured by ourresidue-based SSAP algorithm, gives a score (,60)well below the empirical threshold ($70) used toassign proteins to the same fold group. However,because the CATH database assigns proteins to astructural family using single linkage clustering,i.e. providing they match at least one relative,these proteins have been pulled into the samefamily because of high scores to other closerelatives within the family. Interestingly, thenormalised ratios measured by GRATH, for thematches between these very distant homologues,are very good and would have meant that the pairwould have been selected for further analysis andmanual evaluation on the basis of the GRATHsignificance level. However, the matches constituteless than 60% of the structure of the largest fold.The implications these variations have on the use

of rigid fold group assignments within a hierarchi-cal structural family database are discussed furtherbelow.

Domains that exhibit a low gregariousness arethose that have distinctive folds, with few commonmotifs or common motifs that are packed in unu-sual arrangements or comprise a small proportionof the structure (,20%). Folds found in the lesspopulated architectures, and in fact all of thefolds in each of the following sparsely populatedarchitectures, have a low gregariousness (,0.1):a-horseshoe; a-solenoid; five blade or moreb-propellers; b-3-solenoids; b-distorted sand-wiches; b-clam; a–b horseshoe; a–b box; a–bprism; a–b propeller.

Most of the superhelices exhibit low gregarious-ness despite containing commonly occurringsuper-secondary structural motifs. In this case it isbecause the common motif constitutes a verysmall proportion of the fold. For instance, thearmadillo repeat region from murine b-catenin(PDB 2bct,29 C ¼ 1, A ¼ 25, T ¼ 30) is assembledfrom repeats of a relatively short sequence motifcontaining 42 amino acid residues that folds intoa repeating structural unit comprising threea-helices (Figure 14). The three-helix motif is oftenobserved in orthogonal a-bundle architectures,

Figure 13. 1apyB0 is gregarious, highlighting that it has a significant overlap with many folds, including five shownhere.

Quantifying the Similarities within Fold Space 919

where it typically constitutes 50% of the fold.However in b-catenin this repeat constitutes only10% of the fold. Furthermore, although contiguousarm repeats create an a/a right-handed super-helix, similar 3D arrangements of these repeats arerarely observed in other globular proteins. Conse-quently, the small global overlap between theseand other globular proteins results in a low gregar-iousness value.

Gregariousness in different structural classesand architectures

In summary, gregarious folds often containcommonly occurring super-secondary structuralmotifs. Apart from a single example of a two-layerb-sandwich, all the most gregarious folds whichmatch over 20% of the other folds in the database(gregariousness value .0.2) are a–b proteins, andgenerally occur within highly populated archi-tectural regions of fold space.

However, the gregariousness of folds shows alarge diversity within certain architectures (Figure15). Indeed, most architectures, for which thereare more than fivefold members, show such adiversity. For these architectures, there are folds

with motifs that match significantly to many otherfolds, but there are also folds with unusual topolo-gies that have few significant matches to the restof the fold world, e.g. the b-lactamase fold (PDB2blt35 chain A, C ¼ 3, A ¼ 40, T ¼ 710, a three-layer aba-sandwich) seen in Figure 16. Althoughthe three-layer a–b–a core of the b-lactamase foldis not uncommon, and the fold contains severalmotifs, the nature and degree of embellishment isunusual and extensive, so that any structuralmotifs overlapping the core constitute a verysmall proportion of the overall fold.

Figure 14. Superhelices are distinctive even thoughthey contain a repeat. (a) 2bct00 is a superhelix, (b)1gdtA3 is the motif that is repeated.

Figure 15. Gregariousness measured for the differentfolds. (a) a-Proteins; (b) b-proteins; (c) a/b-proteins.

920 Quantifying the Similarities within Fold Space

The mean gregariousness across the database is0.1, i.e. on average each fold shows some similaritywith 10% of other folds. Although within architec-tures the different folds exhibit different degreesof gregariousness, by calculating the mean, somearchitectures can be described as typically gregar-ious (such as the a–b two-layer sandwich,mean ¼ 0.14), whereas others can be described asnot gregarious (such as the b-roll, mean ¼ 0.07).Figure 17 shows how the mean gregariousnessvaries across architectures. Figure 18 shows gregar-iousness as a function of class. There is a peakeddistribution in gregariousness for a-proteins, indi-cating that many a-proteins have similar values ofgregariousness, which is quite low. This is becausea-proteins exhibit similarity to each other, but aredistinctly different from the folds in other classes,though there may be small overlap with individualhelices and helix hairpins in a–b folds. It mayalso be due to the fact that there appear fewerconstraints on helix packing so that the orien-tations of a-helices can vary considerably inrelatives from the same protein family.36,37

Mainly b-proteins have a more even distributionof gregariousness caused by the diverse range inarchitectures for these proteins, some of whichoverlap but others that are more unique. a/b-pro-teins are seen to be the class that have the mostgregarious members, reflecting the large numberof a/b folds that often re-use similar super-secon-dary structural motifs. Figure 15 indicates thatmany of these gregarious folds come from the a/btwo-layer sandwich architecture, although otherarchitectures also have examples of gregariousfolds.

Discussion

The size calibration used by GRATH has allowedus to introduce the gregariousness measure. Thismeasures the local topology density and shouldtherefore prove useful in quantifying the level ofmotif sharing amongst folds. The gregariousnessmeasure may also provide a second order correc-tion to the formulation of GRATHs statistics, thatis allowing us to normalise scores for a particularfold according to whether that fold contains ahigh proportion of commonly occurring motifs.The first order correction, the size dependence,provides a global measure for the significance offold matches, given the global characteristics offold space.

However, answering certain scientific questionsmay need an accurate description of local regionsof fold space, i.e. how different is a fold from itspeers? Such a measure will obviously be valuablein the competitive assessment of structure predic-tion (CASP) experiments held every two years inthe USA. Structures found to have low gregarious-ness will obviously be much harder to predict byknowledge-based methods than those with highgregariousness.

Gregariousness can also be used to examine theexistence of a fold continuum, i.e. whether existingmotifs are added to when constructing new folds13

Figure 16. The embellishment of b-lactamase is unu-sual and extensive.

Figure 17. The average gregariousness measured foreach architecture in CATH.

Figure 18. The average gregariousness measured foreach class in CATH.

Quantifying the Similarities within Fold Space 921

and whether this continuum contains particularlycommon motifs.15 Moreover, looking for cliques ina graph of an all versus all comparison of foldspace will highlight distinct regions of motifsharing. Our analysis has clearly revealed that notall fold groups and architectures are gregariousand some are very distinct. Thus a fold continuumappears to exist only in certain architecturalregions of fold space, where folds are largelylinked by common super-secondary motifs.

What are the implications of this result for struc-tural classifications? Is it valuable to cluster pro-teins into specific “fold groups” as in the CATHand SCOP hierarchies? Also given the observationthat structural relatives even within homologoussuperfamilies can vary considerably in theirstructure to the point of adopting different folds.12

The hierarchical nature of the SCOP and CATHclassifications, which results in all proteins withinthe same evolutionary superfamily automaticallybeing assigned to the same fold group, reflects anemphasis on recognising evolutionary relation-ships, which may therefore reveal common func-tional properties and inform our understanding ofevolutionary mechanisms. From that point of viewthe clustering of structures into common homolo-gous superfamilies and by implication, commonfold groups, can be very helpful in analysing thestructural changes that have occurred duringevolution. The assignment to a common foldgroup should not be viewed too rigidly or assumea significance beyond that of helping to rationalisethe amount of information in the classification.As such, fold groups should perhaps be moreaccurately described as fold neighbourhoodsgrouping together proteins that share significantstructural motifs either for evolutionary reasons ordue to constraints on secondary structurematching.

Furthermore, analysis of structural variationacross superfamilies, using the gregariousnessmeasure will inform as to the degree of significantstructural overlap occurring within different super-families and fold groups in neighbouring foldspace. This can be used to establish a thresholdE-value for listing all other structural neighboursoccupying the same region of fold space butwhich belong to different fold groups and/ordifferent homologous superfamilies (Figure 19).These links may reveal very distant evolutionaryrelationships that could be validated by more care-ful manual validation and review of the literature.In the future, proteins in the CATH databaselinked in this way from different superfamilies butadopting similar folds, will be grouped into anew level hyperfamily, intermediate between foldgroup and superfamily, to indicate a putativeevolutionary relationship meriting further exami-nation. In addition a structural neighbours list willbe maintained for each Srep in the CATH database,comprising all other Sreps from different foldgroups, which exhibit a significant structural over-lap as measured by GRATH.

Summary

Here we report the mapping of fold space usingGRATH. GRATH employs a graph theoretictechnique and its fold assignments are based onmatches to the set of existing folds within theCATH database1. As protein structures are notrandom objects, GRATH was designed to providean empirical description of the observed simi-larities in the protein fold world. The upper tail ofthe GRATH distribution is found to be describedby an extreme-value distribution, for any domainof interest. The significance of a match betweentwo folds is dependent only on the overlap scoreand the number of secondary structures in thedomain. This is independent of whether thedomain contains a predominance of a-helices,b-strands or a mixture of the two.

The size calibration used by GRATH has allowedus to introduce a measure of the local topologydensity, termed the gregariousness measure.Gregariousness measures how many otherfolds have a significant structural similarity to a

Figure 19. Use of hyperfamilies to link superfamilies.(a) Significant structural overlap but different folds.(b) Hyperfamily links can be within a fold group, indi-cating putative homologues.

922 Quantifying the Similarities within Fold Space

particular fold, and yet have a different overall top-ology. Gregarious folds often contain commonlyoccurring super-secondary structural motifs.Apart from a single example of a two-layer b-sand-wich, all the most gregarious folds that match over20% of the other folds in the database (value .0.2)are a–b-proteins, and generally occur withinhighly populated architectural regions of foldspace.

Domains that exhibit a low gregariousness arethose that have distinctive folds, with few commonmotifs or motifs that are packed in unusualarrangements. Folds found in less populatedarchitectures have a low gregariousness (,0.1).Most of the super-helices exhibit low gregarious-ness despite containing commonly occurringsuper-secondary structural motifs.

Analysis using GRATH suggest that mostgregarious folds contain structural overlaps com-prising common structural motifs such asab-meander, ab-plaits or extensive similarities inb-sheets. Furthermore these overlaps typicallycomprise four to five secondary structures andconstitute typically 40% of the largest structure.The sound statistical basis of GRATHs similaritymeasure can illuminate the structural neighbour-hood of a particular fold and suggest whether thisfold occupies a “continuous” or “discrete” regionof fold space. Mapping these relationships intoour CATH structural classification, will help inassessment of structure prediction and also inidentifying very distant evolutionary relationships,occurring at the level of subdomain.

Methods

GRATH

A graph is a mathematical description of a system. Itrepresents both the layout of a system, and how thecomponents of the system interact with each other. Ingraph theory terminology, the components of the systemare called nodes and the interactions are named edges.The fold of a protein is readily described by a graph.27

Grindley’s description had the nodes as vector represen-tations of the secondary structures, labelled by the typeof secondary structure (a-helix or b-strand). The edgesare labelled by the distance and the “dot-product” anglebetween the two vectors. The nodes of graphs thatGRATH uses have the same labels. Furthermore, theedges of the graphs are labelled by distance and dot-product angle. However, GRATHs edges are alsolabelled with the dihedral angle and a chirality measure.

The translation of structures into graphs allows adetermination of the amount of overlap in the geometri-cal descriptions of two proteins (A.H., F.P., I. Sillitoe, T.Slidel, R.M., J.T. & C.O., unpublished results).27 This isachieved by generating two matrices. The first matrix,called the G1G2 matrix, is generated by finding all thematches in the types of secondary structure (a-helix orb-strand) between two domain graphs. The algorithmproceeds to study pairs of matches in the G1G2 matrix.Each pair of matches in G1G2 corresponds to two nodesin both the first and second protein graphs. The edges

between these two nodes have a distance, two anglesand chirality associated with them. For every secondarystructure label match in G1G2, the edge measurementsin the two proteins are checked to see whether they arethe same, within a chosen error tolerance. This results ina second matrix, referred to as the correspondencematrix. GRATH then uses the Bron & Kerbosch38

algorithm to find cliques in the correspondence graph; aclique in the correspondence graph indicates a set of geo-metrically equivalent secondary structures.

Data set used for benchmarking GRATH

Each fold group in the CATH classification consists ofone or more homologous superfamilies. Within eachhomologous superfamily, proteins are further classifiedinto sequence families, comprising close homologues.Each member of a sequence family must have 35%sequence identity with at least one other family member.GRATH was calibrated with version 1.7 of CATH, whichcontained 1702 different sequence families. One repre-sentative structure was selected from each of thesesequence families to create a data-set for testingGRATH, hereafter referred to as the Sreps data-set.

Representatives were selected from each sequencefamily, rather than each fold group, to capture thestructural variation that can occur across a fold group.Traditionally proteins classified as adopting a commonfold may only share a common structure for sometwo-thirds of their residues generally in the core of thestructures. In some fold groups this common core mayonly occupy 50% of the structure.39 Outside this commonfold core, considerable variation in the degree to whichfolds can be embellished is also seen across homologoussuperfamilies39 and in extreme cases is sometimesdescribed as a change of fold.12 However, within asequence family, a much higher degree of structuralsimilarity is observed. So by selecting representativesfrom each sequence family we can ensure that allcurrently known structural embellishments adoptedby a particular fold group are represented within thedataset and thereby improve the probability of matchinga query structure to its appropriate fold group.

To assess coverage each member of the Sreps librarywas scanned against all other Sreps in the library. Cover-age was measured as the percentage of Sreps for whichan Srep from the same fold group occurred within thetop ten matches returned by GRATH.

Acknowledgements

We are very grateful to and thank: RamSamudrala for providing a copy of the Bron–Kerbosch algorithm in C through the Internet;Jonathan Barker for discussions about the statis-tical issues related to our use of graph theory;Ian Longden, Andreas Brakoulias & David Gilbertfor discussions about aspects of algorithms andimproving the efficiency of GRATH; all themembers of the CATH team for their help andadvice. Finally, we express our thanks to theMedical Research Council, for the provision offunds that enabled us to pursue the aboveresearch.

Quantifying the Similarities within Fold Space 923

References

1. Orengo, C., Michie, A., Jones, S., Swindells, M. &Thornton, J. (1997). CATH—a hierarchic classificationof protein domain structures. Structure, 5, 1093.

2. Murzin, A., Brenner, S., Hubbard, T. & Chothia, C.(1995). SCOP: a structural classification of proteinsdatabase for the investigation of sequences and struc-tures. J. Mol. Biol. 247, 536.

3. Jones, D. T. (2000). Protein structure prediction in thepostgenomic era. Curr. Opin. Struct. Biol. 10, 371.

4. Chothia, C. (1992). One thousand families for themolecular biologist. Nature, 357, 543.

5. Orengo, C., Jones, D. & Thornton, J. (1994). Proteinsuper-families and domain super-folds. Nature, 372,631.

6. Brenner, S., Chothia, C. & Hubbard, T. (1997). Popu-lation statistics of protein structures: lessons fromstructural classifications. Curr. Opin. Struct. Biol. 7(3),369.

7. Sternberg, M. & Thornton, J. (1976). On the confir-mation of proteins: the handedness of the b-strand-a-helix-b-strand unit. J. Mol. Biol. 105, 367.

8. Sibanda, B. & Thornton, J. (1985). b-Hairpin familiesin globular-proteins. Nature, 316, 170.

9. Orengo, C. & Thornton, J. (1993). a/b folds revisited:some favoured motifs. Structure, 1, 105.

10. Hutchinson, E. & Thornton, J. (1996). PROMOTIF—aprogram to identify and analyse structural motifs inproteins. Protein Sci. 5, 212.

11. Lupas, A., Ponting, C. & Russell, R. (2001). On theevolution of protein folds: are similar motifs indifferent protein folds the result of convergence,insertion or relics of an ancient peptide world.J. Struct. Biol. 134, 191.

12. Grishin, N. (2001). Fold change in evolution of pro-tein structure. J. Struct. Biol. 134, 167.

13. Orengo, C., Flores, T., Taylor, J. & Thornton, J. (1993).Identification and classification of protein foldfamilies. Protein Eng. 6, 485.

14. Koppensteiner, W., Lackner, P., Wiederstein, M. &Sippl, M. (2000). Characterization of novel proteinsbased on known protein structures. J. Mol. Biol. 296,1139.

15. Holm, L. & Sander, C. (1996). Mapping the proteinuniverse. Science, 273, 595.

16. Shindyalov, I. & Bourne, P. (1998). Protein structurealignment by incremental combinatorial extension(CE) of the optimal path. Protein Eng. 11, 739.

17. Orengo, C. (1994). Classification of protein folds.Curr. Opin. Struct. Biol. 4, 429.

18. Taylor, W. & Orengo, C. (1989). Protein structurealignment. J. Mol. Biol. 208, 1.

19. Holm, L. & Sander, C. (1993). Protein structure com-parison by alignment of distance matrices. 233, 123.

20. Madej, T., Gibrat, J.-F. & Bryant, S. (1995). Threadinga database of protein cores. Proteins: Struct. Funct.Genet. 23, 356.

21. Sowdhamini, R., Burke, D., Huang, J., Mizuguchi, K.,Nagarajaram, H., Srinivasan, N. et al. (1998). CAM-PASS: a database of structurally aligned proteinsuperfamilies. Structure, 6, 1087.

22. Orengo, C., Brown, N. & Taylor, W. (1992). Faststructure alignment for protein databank searching.Proteins: Struct. Funct. Genet. 14, 139.

23. Holm, L. & Sander, C. (1998). Dictionary of recurrentdomains in protein structures. Proteins: Struct. Funct.Genet. 33, 88.

24. Levitt, M. & Gerstein, M. (1998). A unified statisticalframework for sequence comparison and structurecomparison. Proc. Natl Acad. Sci. 95, 5913.

25. An-Suei, Y. & Honig, B. (2000). An integratedapproach to the analysis and modeling of proteinsequences and structures. II. On the relationshipbetween sequence and structural similarity for pro-teins that are not obviously related in sequence.J. Mol. Biol. 301, 679.

26. Dietmann, S. & Holm, L. (2001). Identification ofhomology in protein structure classification. NatureStruct. Biol. 8, 953.

27. Grindley, H., Artymiuk, P., Rice, D. & Willet, P.(1993). Identification of tertiary structure resem-blance in proteins using a maximal common sub-graph isomorphism algorithm. J. Mol. Biol. 229, 707.

28. Altschul, S., Gish, W., Miller, W., Myers, E. &Lipman, D. (1990). Basic local alignment search tool.J. Mol. Biol. 215, 403.

29. Huber, A., Nelson, W. & Weis, W. (1997). Three-dimensional structure of the armadillo repeat regionof beta-catenin. Cell, 90, 871.

30. Matsuo, H., Li, H., McGuire, A., Fletcher, C., Gingras,A., Sonenberg, N. & Wagner, G. (1997). Structure oftranslation factor eIF4E bound to m7GDP and inter-action with 4E-binding protein. Nature Struct. Biol. 4,717.

31. Rey, F., Heinz, F., Mandl, C., Kunz, C. & Harrison, S.(1995). The envelope glycoprotein from tick-borneencephalitis virus at 2 A resolution. Nature, 375, 291.

32. Oinonen, C., Tikkanen, R., Rouvinen, J. & Peltonen,L. (1995). Three-dimensional structure of humanlysosomal aspartylglucosaminidase. Nature Struct.Biol. 2, 1102.

33. Ploom, T., Thony, B., Yim, J., Lee, S., Nar, H.,Leimbacher, W. et al. (1999). Crystallographic andkinetic investigations on the mechanism of 6-pyru-voyl tetrahydropterin synthase. J. Mol. Biol. 286, 851.

34. Todd, A., Orengo, C. & Thornton, J. (2001). Evolutionof function in protein superfamilies, from a struc-tural perspective. J. Mol. Biol. 307, 1113.

35. Kinch, L. & Grishin, N. (2002). Evolution of proteinstructures and functions. Curr. Opin. Struct. Biol. 12,400.

36. Lobkovsky, E., Moews, P., Liu, H., Zhao, H., Frere, J.& Knox, J. (1993). Evolution of an enzyme activity:crystallographic structure at 2 A resolution ofcephalosporinase from the ampC gene of Entero-bacter cloacae P99 and comparison with a class Apenicillinase. Proc. Natl Acad. Sci. USA, 90, 11257.

37. Chothia, C. & Lesk, A. (1986). The relation betweenthe divergence and structure in proteins. EMBO J. 5,823.

38. Bron, C. & Kerbosch, J. (1973). Algorithm 457—finding all cliques of an undirected graph. Commun.Assoc. Comput. Mach. 16, 575.

39. Orengo, C., Sillitoe, I., Reeves, G. & Pearl, F. (2001).What can structural classifications reveal aboutprotein evolution. J. Struct. Biol. 134, 145.

Appendix

During the development of GRATH severaltheoretical approaches for assigning significance,to matches between proteins, were explored.However, as this section reviews, the graphs ofproteins have properties that result in problems

924 Quantifying the Similarities within Fold Space

for each of the theoretical avenues explored. Belowwe describe the problems encountered whenattempting to develop a theoretical frameworkand this is presented as a justification for adoptingan empirical approach for assessing the statisticalsignificance of a GRATH match.

GRATH searches for cliques from within acorrespondence graph. The correspondence graphresults from the product of two domain graphsand it details whether the geometric informationin the two graphs, between pairs of secondarystructure, is the same within a given tolerance.The correspondence graph has topologicalconstraints imposed upon it because proteinshave a direction, from the N terminus to the Cterminus.

The topological constraint allows certain edgesto be part of cliques of a certain size, but whoseexistence in larger cliques can be ruled out. Forexample, the correspondence edge which trans-lates to the edge between the first and secondsecondary structures in the first graph, and thepenultimate and final secondary structures in thesecond graph cannot possibly be in any clique ofsize greater than two. Moreover, because thecliques are ordered by topology, edges in the corre-spondence graph can only correspond to a limitedset of nodes in the graphs of the two comparisonproteins. To illustrate this point, a graph of fivenodes (all a) is compared with a graph of sixnodes (all a), which results in the G1G2 matrixseen in Table A1. The matrix contains 75 sequencesof length 4 that are topologically correct. Examplesinclude ½1; 8; 15; 22�; ½1; 14; 22; 30� and ½3; 10; 17; 30�but not ½1; 8; 14; 22� (8 and 14 correspond to thesame secondary structure in the graph with sixnodes) or ½2; 7; 15; 22� (going from 2 to 7 results ina move in different directions for the two graphs).The combination of G1G2 positions 1 and 8 appearin 18 of the 75, whereas the combination 7 and 28only appears once and the combination 19 and 26does not appear at all. Furthermore, the combi-nation 1 and 8 always appears first (node 1 is 1,node 2 is 8), whereas the combination 8 and 15can appear second and third (node 1 is 1, node 2is 8, node 3 is 15) or first and second (node 1 is 8,node 2 is 15). The discounting of edges fortopological reasons means that the substantialtheory of random graphsA1 is not directly applic-able to the present situation, because thecorrespondence graphs that GRATH generates arenon-random.

The number of cliques, Nc, of size C in a proteindomain graph of size G, is:

Nc ¼G!

C!ðG 2 CÞ!ðA1Þ

It follows that the number of cliques, Ncg, of size Cin the correspondence graph, which is the productof two graphs of size G1 and G2, is:

Ncg ¼G1!G2!

C!ðG1 2 CÞ!C!ðG2 2 CÞ!ðA2Þ

If the probability of an edge being found is q, and qis constant across the graph, any clique of size Cwill be found with a probability Pf where:

Pf ¼ qCðC21Þ=2 ðA3Þ

A realistic estimate of q is expected to be of theorder of C21. It follows that Pf is very small ( p 1),and so Palo, the probability that at least one cliqueis found, can be written as:

Palo < NcgqCðC21Þ=2 ðA4Þ

The expression for Palo is based on the assumptionthat the probability per edge, q, is constant acrossthe correspondence graph. However, this is notthe case in the present situation. As discussedabove, certain edges can only occur at certainpositions in the clique. Therefore, Pf should beformally written as:

Pf ¼Yi¼Ne

i¼1

qi ðA5Þ

qi can be found by sampling the correspondencegraph in the relevant positions and weighting theedges by the frequency in which they are observedwithin a clique.

As an example, the probability is estimated for acomparison between a graph with ten nodes andanother with ten nodes, from which a clique ofsize five is found. For any comparison betweenany one pair of secondary structure in one domainwith the pairs in another domain, it is unlikelythat more than two to three pairs in the seconddomain will match. As many pair comparisons aremade, and only a few match, the probability peredge, q, is small, typically 0.1–0.2. For thisexample, an uncertainty of a factor of two in theprobability per edge results in an uncertainty inthe final probability estimate of three orders ofmagnitude. This large uncertainty results fromraising an uncertain number, q, to a large power.Deriving qi involves a weighted mean and so theexistence or otherwise of only a few edges canradically alter the value for qi. Ultimately, suchsensitivity to q in the probability estimate meansthat reliable expectation values cannot bedetermined.

Protein structures actually contain distinctivenode types, a-helices and b-strands, rather than asingle node type. In this case, the number of pos-sible cliques of size C is not given by equation

Table A1. G1G2 matrix

a a a a a a

a 1 2 3 4 5 6a 7 8 9 10 11 12a 13 14 15 16 17 18a 19 20 21 22 23 24a 25 26 27 28 29 30

Quantifying the Similarities within Fold Space 925

(A2), but by a list of all the topologically correctpatterns of size C that are common to both graphsbeing compared. The probability per edge, q, isfound by looking at the frequency with which indi-vidual edges occur within these patterns. There isno simple analytical expression for this frequency,it would have to be determined by a brute forcesearch. Such a search rapidly becomes computa-

tionally expensive as the clique and graph sizesare increased.

References

A1. Bollobas, B. (1985). Random Graphs, Academic Press,New York.

Edited by B. Honig

(Received 10 October 2001; received in revised form 6 September 2002; accepted 10 September 2002)

926 Quantifying the Similarities within Fold Space