Maps of protein structure space reveal a fundamental relationship between protein structure and function

8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

1/23

Maps of protein structure space reveal a fundamentalrelationship between protein structure and functionMargarita Osadchy and Rachel Kolodny1

Department of Computer Science, University of Haifa, Mount Carmel, Haifa 31905, Israel

Edited by Sung-Hou Kim, University of California, Berkeley, CA, and approved June 7, 2011 (received for review February 20, 2011)

To study the protein structurefunction relationship, we propose a

method to efficiently create three-dimensional maps of structure

space using a very large dataset of >30,000 Structural Classification

of Proteins (SCOP) domains. In our maps, each domain is repre-

sented by a point, and the distance between any two points

approximates the structural distance between their corresponding

domains. We use these maps to study the spatial distributions of

properties of proteins, and in particular those of local vicinities in

structure space such as structural density and functional diversity.

These maps provide a unique broad view of protein space and thus

reveal previously undescribed fundamental properties thereof. Atthe same time, the maps are consistent with previous knowledge

(e.g., domains cluster by their SCOP class) and organize in a unified,

coherent representation previous observation concerning specificprotein folds. To investigate the functionstructure relationship,

we measure the functional diversity (using the Gene Ontology con-

trolled vocabulary) in local structural vicinities. Our most striking

finding is that functional diversity varies considerably across struc-

ture space: The space has a highly diverse region, and diversity

abates when moving away from it. Interestingly, the domains in

this region are mostly alpha/beta structures, which are known to

be the most ancient proteins. We believe that our unique perspec-

tive of structure space will open previously undescribed ways of

studying proteins, their evolution, and the relationship between

their structure and function.

global map of protein universe protein function prediction

protein structure universe

Investigating protein structure space and its relationship to func-tion space is a fundamental scientific challenge. Characterizingthis relationship may also carry practical implications to proteinfunction prediction, whereby one wishes to infer the biologicalrole of a protein from its structure [as is the case with manyof the structures solved in the high-throughput pipeline of theStructural Genomics projects (1, 2)]. One way to approach thischallenge is to represent protein structure space by three-dimen-sional maps. Maps of structure space were first introduced byHolm and Sander (3) and were later used by Kim and colleagues(46). To calculate their maps, they first calculate the structuralsimilarity between all pairs of protein structures. Then, they usemultidimensional scaling (MDS) to find a collection of points in

three dimensions, each of which corresponds to a protein, andwhere the distance between any two points depends on the struc-tural similarity of the proteins they represent. Such a representa-tion provides a comprehensive visual view of structure space,

which is not constrained by a hierarchical system such as theStructural Classification of Proteins (SCOP) (7).

We propose an efficient way to calculate maps of protein struc-ture space, using the recently introduced FragBag model (8).Using FragBag, we represent each structure as a point in a high-dimensional space and project these points to three dimensions.It was recently shown that the similarity between the FragBag

vectors, or the points in the high-dimensional space, can identifynear structural neighbors as accurately as the state-of-the-artstructural aligners STRUCTAL and CE, for several definitionsof near structural neighbors (8). Because FragBag models struc-

tures as fixed-size vectors, we can replace MDS with a more effi-cient procedure, Principal Component Analysis (PCA) (9). Thus,

we can map a very large set of >30;000 protein structures. Ratherthan studying single structures, we study properties such asstructural density and functional diversity, which are defined ateach point of structure space through a whole collection of struc-tures in the vicinity of that point. By coloring the maps accordingto the values of these properties, we are able to visualize theirdistribution across structure space. This way we discover thatstructure space has a region of high functional diversity and thatthis region consists mainly of alpha/beta structures, which areknown to be the most ancient proteins (10). We believe thatstudying such maps holds great promise to revealing important

properties of protein structure space, its relation to function, andperhaps even to sequence.

Results

Constructing Functional Diversity Maps of Protein Structure Space. Tostudy protein structure space we analyze a set of 31,155 SCOP

v1.71 (7) domains. We initially represent each such domain bya 400-long FragBag vector, which may be thought of as a pointin 400-dimensional space. In the FragBag model, a protein struc-ture is represented by a count vector of backbone fragments takenfrom a library of 400 commonly occurring 12-residue fragments.For each contiguous (and overlapping) 12-residue segment alongthe protein backbone, we identify the library fragment that fits itbest in terms of RMSD after optimal superposition. The ith entryin the FragBag vector is the number of times the librarys ith

fragment was found to be the best fit. The FragBag distance be-tween two domains is the distance between their FragBag vectors.We have recently shown that this distance is a good approxima-tion of the structural distance, as quantified by structural align-ment (8). Using principal component analysis (PCA) (11), wethen project the points to three-dimensional space. The eigenva-lues of the resulting data covariance matrix (Fig. S1) drop sharplyand the fourth largest eigenvalue (0.0106) is 8% of the firstlargest eigenvalue (0.1326); this indicates that three dimensionscan adequately represent the essential features of protein struc-ture space. Fig. 1 BD shows a three-dimensional map of proteinstructure space, in which each domain is colored by its SCOPclass (7); we show three views of the map from three angles, toget a better sense of it. As expected, the domains cluster by their

SCOP class.The density of protein structure space is uneveni.e., certainregions have more domains per unit volume than others. Thiscan be seen in Fig. 1 FH, which shows again the three viewsof the map, now colored according to the density score of eachdomainthe number of domains that are within a 0.005 distance

Author contributions: M.O. and R.K designed research, R.K performed research; R.K

analyzed data; and M.O and R.K wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

1To whom correspondence may be addressed. E-mail: [email protected] or trachel@cs.

haifa.ac.il.

This article contains supporting information online at www.pnas.org/lookup/suppl/

doi:10.1073/pnas.1102727108/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1102727108 PNAS July 26, 2011 vol. 108 no. 30 1230112306
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdf


2/23

from it. Certain proteins are more studied than others, and as aresult, more variants thereof are included in our dataset. To ruleout this bias as the source of the observed uneven density, weprepared similar density maps, based on 40% and 95% sequencenonredundant subsets of the original data (containing 2,517 and4,238 domains, respectively). The results, shown in Fig. S2, arequalitatively identical to the original density map, and the corre-lation between the original density scores and those based on therestricted sets are very high (r 0.945 and r 0.960 for the 40%and the 95% sets, respectively). In the remainder of this study, weuse the full dataset.

Upon inspecting Fig. 1, one can see that there is a relation

between the SCOP class and the density score of a domain.Fig. 2A, which is a histogram of the density scores of the domains,color-coded by their SCOP class, shows this more clearly: Thealpha beta (yellow) and the all-beta (red) domains tend to re-side in low-density regions, whereas the all-alpha (blue) domainsconstitute the vast majority in the very high-density regions.

Next, we investigate how functional diversity varies acrossstructure space; for this, we quantify the functional diversity inthe vicinity of each domain in our dataset. We consider three de-finitions for the vicinity of a domain d: (i) Vfn is a fixed number(100) of the nearest structural neighbors ofd, (ii) Vsamp is a sam-ple of fixed size (100) from the domains that lie within a fixedstructural distance (0.005) from d, and (iii) Vfd is the collectionof all domains that are within some fixed structural distance(0.005) from d. Although Vfd is perhaps the most natural defini-

tion, it makes the vicinities of domains in denser regions containfar more members, which may bias the results.

Our measure for the functional diversity in a vicinity of a pro-tein, however vicinity is defined, is the number of distinct func-tions that the domains within this vicinity possess. To determinefunction, we use the functional annotations of the proteins fromthe Gene Ontology molecular function (GO-MF) controlled

vocabulary (12), and the mapping of terms to SCOP domainscalculated by Lopez and Pazos (13). When a single domain is an-

notated as having more than one function, we include all its func-tions toward the count.

Structure Space Has a Core of High Functional Diversity. Fig. 3 BDshows a functional diversity map of protein structure space. Thedomains in the map are color-coded according to the functionaldiversity of their vicinities (red for the most diverse ones; blue forthe least diverse), and vicinity is defined to be Vsamp (when there

were fewer than 100 domains within this distance, all wereincluded). This map shows a striking pattern: Protein space hasa highly diverse core, and diversity drops gradually toward itsperiphery (we denote the high diversity region core, because ofits location in our maps). Figs. S3 and S4 show the maps con-structed using the two alternative definitions of a vicinity, Vfnand Vfd; the results are very similar.

As a control for the validity of our finding, we re-created thediversity map (using Vsamp again) after randomly permuting thefunctional annotations across all domains (i.e., the set of func-tional annotations originally associated with each domain wasassociated with a different, randomly chosen domain). If our find-ing were merely an artifact of the projection to three dimensions,or of some feature of protein structure space (say, the unevendensity), the resulting diversity map would show again a highlydiverse core. Fig. 3 FHshows that this is not the case: Under the

Fig. 1. Maps of protein structure space. Each point represents a SCOP

domain, and the distance between any two points approximates the struc-

tural distance between their corresponding domains. BD show the map of

the SCOP classes: As expected, the points are clustered. FH show the struc-

tural density map, where the color of each point indicates the number of

domains that lie in its vicinity of fixed distance (denoted Vfd). We see that

the highest density is within the regions of the all-alpha domains, followed

by a region in the alpha/beta domain and in the all-beta domain. Fig. S2

shows a similar density map when considering sequence nonredundant

samples of the protein world.

B

A C1,600

Fig. 2. Structural density and functional diversity by SCOP class. We calculate

the separate histograms of structural density (A) and functional diversity

(B) of each of the SCOP classes and stack them one on top of the other.

We see that the densest regions are populated by all-alpha domains, and the

most functionally diverse regions by the alpha/beta domains. See Table S2

(listing the exact proportions of each of the SCOP classes, among the top

10%20% most dense/functionally diverse domains) and Fig. S12 for support-

ing evidence.

12302 www.pnas.org/cgi/doi/10.1073/pnas.1102727108 Osadchy and Kolodny
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdf


3/23

random permutation, the diversity score of almost all domains isvery high (colored in orange and red), and the map has no pro-

minent diverse core; the relatively few domains with low diversityscores (colored in blue) are mostly isolated domains, having fewerthan 100 neighbors within a 0.005 distance, and thus necessarilyless diverse vicinities (there are 5,356 such domains). The exis-tence of the diverse core is indeed a statistically significant finding(p < 0.005; see Methods for details). When using Vfn, the resultsare very similar (Fig. S3); as expected, when using Vfd, diversity ishighly correlated (r 0.953) with density, because domains indenser regions now have more members in their vicinities, andthus more functional annotations (Fig. S4).

We can reliably predict the functional diversity of structures ina randomly chosen test set, using the mapping calculated for atraining set. Our test set consists of 250 randomly chosen struc-tures from the sequence nonredundant set (using a 40% sequenceidentity threshold); it has 52, 40, 92, 52, and 14 domains of the

SCOP classes all-alpha, all-beta, alpha/beta, alpha beta, andothers, respectively. The training set has the 29,014 domains thatshare no sequence similarity with the test set proteins (BLASTE-value threshold of103 and sequence identity of 40%). UsingPCA of the training set FragBag data, we calculate the projectionPtrain to R

3. For each test set proteins p, we calculate Ptrainp andidentify the structures in ps training-set vicinity. The predictedfunctional diversity score is the number of unique GO-MF terms

within this vicinity. Fig. S5 plots the predicted functional diversity

scores vs. the ones calculated using the complete dataset for thethree definitions of vicinity, Vsamp, Vfn, and Vfd, and shows thatthese scores are highly correlated (r > 0.96).

A potential explanation for the high functional diversity inthe core is that the core contains a high proportion of multiple-function domains, compared to the periphery (recall that multi-ple-function domains contribute all their functions toward thediversity). This is not the case: Fig. S6 shows a functional multi-plicity map of structure space, i.e., a map in which each point iscolored according to the number of GO-MF annotations of thedomain it represents. The high functional diversity core seen inFig. 3 and Fig. S3 does not overlap with a region of high func-tional multiplicity. Further, we see the highly diverse core evenafter reconstructing the functional diversity maps using onlydomains annotated by only one function (61% of the data);

see Fig. S7.Another potential, yet invalid, explanation for the high func-

tional diversity in the core is related to the uneven degree ofdetail in the GO-MF vocabulary. The GO is implemented as ahierarchical directed graph, in which the terms are placed atthe nodes and the edges direct from the general to the specific.The level of detail in the GO-MF graph is uneven: Some areasare better studied and correspond to subgraphs of the GO-MFgraph that have more levels and, ultimately, more functionalannotations. In addition, proteins of the same function some-times have annotations at different levels (14). One could arguethat perhaps the proteins that lie in the core happen to have func-tions that are described in finer detail, and the apparent highdiversity of this core is merely an artifact of the uneven level ofdetail in the GO-MF graph. To demonstrate that this, again, is notthe case, we create functional diversity maps based on Watson etal.s GO-slim controlled vocabulary (14). GO slim is a trimmed

variant of GO-MF in which function is defined more broadly, byonly 190 terms (out of >7;800); in particular, GO slim targets alevel of detail in which neighboring proteins in structure spacehave similar functions. Fig. S8 shows a map that was constructedsimilarly to the one in Fig. 3 and Fig. S3, except that the functionannotations are replaced by their more general terms in theGO-slim graph. Once again we see the same phenomenon: adiverse core and more homogeneous periphery. Indeed, these al-ternative scores are highly correlated with the original diversityscore (r > 0.895); see Table S1.

We also consider three alternatives to the functional diversityscore used above. Two of these alternatives are based on a

weighted count of distinct GO-MF terms within a vicinity, ratherthan on a simple count. In the first, commonly occurring termshave a lower weight, and in the second, more specific terms(i.e., ones that are farther from the root in the GO-MF graph)have a lower weight. In the third alternative, the score is basedon the coherence measure proposed in refs. 15 and 16, whichquantifies the contribution of a functional annotation term toa vicinity based on statistical tests. When using vicinity definitionsVsamp, and Vfn, these alternative scores are correlated with theoriginal diversity score (r > 0.79); see Table S1. Indeed, the func-tional diversity maps under each of the three alternative scores,shown in Figs. S9S11, look similar to the one in Fig. 3.

Characterizing the Cores Structures. A comparison of the func-tional diversity maps (Fig. 3 BD) and the SCOP-class maps

Fig. 3. Functional-diversity map of protein structure space. The color of a

point indicates the degree of functional diversity measured by the numberof distinct GO-MF terms annotating the domains in its vicinity. Here, we use

the Vsamp definition for a vicinity of a protein: a sample of fixed size from all

domains that fall within a fixed distance from it. AD show the functional

diversity for the true data; EH show the functional diversity of a random

world, in which the proteins have the same structures, yet their functions

are assigned at random. We see that when using the true functional annota-

tions, there is a core of high functional diversity, and that functional diversity

drops toward the periphery. Alternatively, when the functions are assigned

at random, there is no such core, and function diversity is uniformly high. The

figures in SI Appendix, and Table S1, show that the results are qualitatively

similar when using alternative datasets, scoring functions, and the more

uniform (coarser) annotation graph GO slim.

Osadchy and Kolodny PNAS July 26, 2011 vol. 108 no. 30 12303
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdf


4/23

(Fig. 1 BD) reveals that the core of high functional diversity con-sists mainly of alpha/beta domains (colored in green). Fig. 2B andFig. S13 A and B show this finding in another way, via histogramsdetailing the contribution of each SCOP class to the functionaldiversity scores. Table S2 lists the exact proportions of the variousSCOP classes among the domains with the top 10% and 20%functional diversity scores; in all cases (including when consider-ing the diversity scores only within the sequence nonredundantsets), the majority of the high functional diversity domains arealpha/beta proteins.

Fig. 4 highlights several SCOP folds that lie in the diverse coreof structure space. The full dataset is shown in Fig. 4A with ablack outline; Fig. 4 BF show specific SCOP folds within this

outline, alongside the histograms of their functional diversityscores. The most obvious candidate for the SCOP fold whosestructures lie in the core is the TIM barrels (c.1), which are wellknown to accommodate many functions (2). Indeed, these lie inthe core, and their functional diversity scores are clearly highercompared with the full dataset (Fig. 4B). We see, however, thatthe TIM barrels are only a part of the picture, as the core containsalso many other domains. Fig. 4C shows SCOP fold adenine nu-cleotide alpha hydrolase-like (c.26) that was also noted as accom-modating many functions (1) and also lies within the core.

To identify more SCOP folds in the core, we search for foldswith (more than 25) domains that lie in functionally diverse vici-nities. We quantify the diversity of a SCOP fold by the averageand the median of the diversity scores of its domains, using thediversity scores based on the three definitions of vicinity. Table S3

lists the 20 most diverse folds under these measures: Each of theresulting six measures identifies different SCOP folds as the mostdiverse. To identify SCOP folds that are truly diverse, we considerfolds that are among the 20 most diverse folds under all sixmeasures. Nine folds satisfy this condition: 7-stranded beta/alphabarrel (c.6), ClpP/crotonase (c.14), methylglyoxal synthase-like(c.24), arginase/deacetylase (c.42), phosphorylase/hydrolase-like(c.56), alpha/beta-hydrolases (c.69), AraD-like aldolase/epimer-ase (c.74), amidase signature enzymes (c.117), protein kinase-like (PK-like) (d.144). As expected, the domains in these folds areindeed located in the core; Fig. 4 DF shows three examples.

Better Predicting of Function from Structure in Regions of Low

Functional Diversity. We use the set of 90 proteins* studied by Wat-son et al. (14) to assess if one can indeed better predict functionfor proteins in regions of structure space having low functionaldiversity. Watson et al. predicted function using global structuralsimilarity [as detected by secondary-structure matching (SSM)(17)] and evaluated the correctness of their predictions. Fig. S13maps the protein structures used in their experiment: on the rightthese structures within our dataset, and on the left, the samestructures with markers indicating if the prediction was correct.We see that Watson et al. better succeed in predicting the func-tion of proteins that lie in regions of low functional diversity.

Fig. 4. SCOP folds that lie in the functionally diverse core. We highlight the location in structure space of specific SCOP folds and show histograms of the

diversity of the domains of these folds; for comparison, A shows the full dataset (a copy of Fig. 3 A and B) outlined in black. B and Cshow two SCOP folds that

are known to be functionally diverse, the TIM barrel fold (c.1) and the adenine nucleotide alpha hydrosase-like fold (c.26). Indeed, the domains of these two

folds are located in the highly diverse core of structure space. There are, however, many other domains in the core. DF show three more examples of SCOP

folds that lie in the highly diverse core: phosphorylase/hydrolase-like (c.56), alpha/beta-Hydrolases (c.69), and protein kinase-like (PK-like, d.144), respectively.

Table S3 lists the mean and average functional diversity scores for several SCOP folds that lie in the core.

*Denoted the known-function dataset; 1nrh, 1tea were removed because they are

obsolete.

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdf


5/23

We quantify this difference by separating the proteins to twosets, according to their functional diversity, and comparing thesuccess rate in these sets. The first set consists of 35 proteins hav-ing high diversity (45) vicinities, and the second consists of 55proteins having low diversity (


6/23

This is a generalization of a call for caution recently made withrespect to function prediction for TIM barrels (24). Indeed, ouranalysis of Watson et al.s data (14) shows that they were moresuccessful in predicting function from structure for proteins lyingin less diverse regions of structure space. Thus, it seems that onecould use the functional diversity maps to better choose the para-meters of structure-based function prediction, according to thelocation of the target protein in structure space, and perhaps evento assign confidence levels for the prediction.

Materials and MethodsRepresenting Protein Domains in 400 Dimensional Space. For each domain in

the dataset, we calculate FragBag (8) description vectors of length L 400

based on a library of 400 12-mer fragments (http://cs.haifa.ac.il/~ibudowsk/

libraries/centers400_12.txt); each entry in the vector is the number of times

thecorresponding library fragment wasthe best approximation of anyof the

12-mer fragments in the backbone of the represented protein. A list of one

or more GO-MF annotations is associated with each domain. Our dataset

includes N 31;155 SCOP v1.71 (7) domains for which Lopez and Pazos

(13) provide a GO annotation. We have previously shown that the cosine dis-

tance between two FragBag vectors best approximates the structural align-

ment score (SAS) (25) between their corresponding structures (8). Notice that

the Euclidean (norm 2) distance between two FragBag vectors that were nor-

malized to length 1 is exactly twice their cosine distance. To see this, consider

p1 and p2 two FragBag vectors, and let ^p1 p1p1

, ^p2 p2p2

be the normalized

vectors. The cosine distance between p1 and p2 is 1 cosp1;p2 1 ^p1T ^p2;

the Euclidean distance between the normalized vectors is

^p1 ^p2T^p1 ^p2 ^p1

T^p1 ^p2T^p2 2^p1

T^p2 2 2^p1T^p2

21 ^p1T^p2:

Thus, we normalize all FragBag vectors and consider the Euclidean (norm 2)

distance; because all distances are relative, the uniform factor 2 is of no

consequence.

Projecting to Three Dimensions. We store the normalized descriptions of

length L 400 of the N structures in our dataset in an L N matrix and

project it to three dimensions using PCA. Namely, after centering the

L N coordinates about the origin (by subtracting their mean), we calculate

the L L covariance matrix (normalized by N) and find the eigenvectors cor-

responding to its three largest eigenvalues. By multiplying theseeigenvectors(a 3 L matrix) by the L N data matrix, we find the 3 N matrix that is the

projection of our data to three dimensions. There, the Euclidean (norm 2)

distances between two 3D vectors is an approximation of their Euclidean

(norm 2) distances in L dimensions. We emphasize that this requires only

the easy computation of finding the top three eigenvalues and eigenvectors

of the relatively small L L matrix. This is in contrast to the slightly different

calculation done in previous studies: Given N structures, they calculate a sym-

metric matrix D of size N N of all pairwise structural distances and use MDS

to find the coordinates of the points representing these N structures in three

(or two) dimensions (3, 5). The technical bottleneck in the MDS calculation is

finding the top three (or two) eigenvalues and eigenvectors of an N N

matrix derived from D (26); it is a challenging computation for datasets of

several tens of thousands proteins. Indeed, the datasets in previous studies

were smaller (e.g., less than 1,900 structures in ref. 6).

Calculating Alternative Functional Diversity Scores.Each of the domains in thedataset has a list of its GO-MF terms; in each case, the terms are the most

specific ones (rather than the term and all its parents). For each term, we

calculate its weighted functional diversity in two ways: (i) (110* the fraction

of its occurrence), where the fraction of its occurrence is the fraction of

domains that are annotated by it; the scaling factor was determined to

be 10, to better space the range of values in the dataset. (ii) The inverse

of the depth of the term in the GO-MF annotation graph; the depth is

the number of times we can replace the terms by more general ones until

we reach the root. There are seven cases (out of 9,500) in which a term

has two different depths, and these differ by at most three (this is a conse-

quence of GO being a graph rather than a tree). In these cases, we use the

average depth. To calculate the coherence measure, we check for each

term and vicinity if the term is enriched in the vicinity, i.e., if it appears

at a rate that is statistically significant. The coherence is the percent of

the terms in a region that are enriched. Thus, the coherence measure is a

value between 0100%, and high coherence implies low diversity and viceversa; see ref. 16 for more details. Finally, the GO-slim annotation of a func-

tional term is the most specific parent(s) of the term that is present in the

GO-slim annotation graph.

Measuring the Spatial Spread of the Core in True and Random Associations of

Functional Annotations to Structures. We measure the spatial spread of the

most diverse domains by their average distance from their center of mass.

We consider two definitions of the most diverse proteins: ( i) all domains

whose diversity scores are greater than 0.8 max_diversity, where max_

diversity is the highest diversity score found in our dataset; (ii) the 20% most

functionally diverse proteins. We measure the spatial spread of the most

diverse domains in our dataset, and in 300 random assignments of the func-

tional annotations tolocations in structurespace. The average distance in the

true dataset for these two definitions is 0.0860 and 0.1131, respectively. In

the random permutations, the average distances are 0.3501 0.0151 and

0.3499 0.0220, respectively, resulting in a p value < 0.0033.

ACKNOWLEDGMENTS. We thank Yuval Nov, Golan Yona, Chen Keisar, and ouranonymous reviewers for their helpful comments. R.K was supported by theMarie Curie IRG Grant 224774.

1. Redfern OC, Dessailly B, Orengo CA (2008) Exploring the structure and function

paradigm. Curr Opin Struct Biol 18:394402.

2. Friedberg I (2006) Automated protein function predictionthe genomic challenge.

Brief Bioinform 7:225242.

3. Holm L, Sander C (1996) Mapping the protein universe. Science 273:595603.

4. Hou J, Jun SR, Zhang C, Kim SH (2005) Global mapping of the protein structure space

and application in structure-based inference of protein function. Proc Natl Acad Sci

USA 102:36513656.

5. Hou J, Sims GE, Zhang C, Kim SH (2003) A global representation of the protein fold

space. Proc Natl Acad Sci USA 100:23862390.

6. Choi IG, Kim SH (2006) Evolution of protein structural classes and protein sequence

families. Proc Natl Acad Sci USA 103:1405614061.

7. MurzinAG, Brenner SE,HubbardT,ChothiaC (1995)SCOP: A structural classificationof

proteins database for the investigation of sequences and structures. J Mol Biol

247:536540.

8. Budowski-Tal I, Nov Y, Kolodny R (2010) FragBag, an accurate representation of

protein structure, retrieves structural neighbors from the entire PDB quickly and

accurately. Proc Natl Acad Sci USA 107:34813486.

9. Tenenbaum JB, Silva Vd, Langford JC (2000) A global geometric framework for

nonlinear dimensionality reduction. Science 290:23192323.

10. Winstanley HF, Abeln S, Deane CM (2005) How old is your fold? Bioinformatics

21:i449i458.

11. Jolliffe IT (2002) Principal Component Analysis (Springer, New York).

12. Harris MA, et al. (2004) The Gene Ontology (GO) database and informatics resource.

Nucleic Acids Res 32:D258261.

13. Lopez D, Pazos F (2009) Gene Ontology functional annotations at the structural

domain level. Proteins 76:598607.

14. Watson JD, et al. (2007) Towards fully automated structure-based function prediction

in structural genomics: A case study. J Mol Biol 367:15111522.

15. Segal E, et al. (2003) Module networks: Identifying regulatory modules and their

condition-specific regulators from gene expression data. Nat Genet 34:166176.

16. Slonim N, Atwal GS, Tkacik G, Bialek W (2005) Information-based clustering. Proc Natl

Acad Sci USA, 102 pp:1829718302.

17. Krissinel E, Henrick K (2004) Secondary-structure matching (SSM), a new tool for fast

protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr

60:22562268.

18. Kolodny R, Petrey D, Honig B (2006) Protein structure comparison: implications for

the nature of fold space, and structure and function prediction. Curr Opin Struct Biol

16:393398.

19. Petrey D, Honig B (2009) Is protein classification necessary? Toward alternative

approaches to function annotation. Curr Opin Struct Biol 19:363368.

20. Sims GE, Choi IG, Kim SH (2005) Protein conformational space in higher order phi-Psi

maps. Proc Natl Acad Sci USA 102:618621.

21. Kolodny R, Koehl P, Levitt M (2005) Comprehensive evaluation of protein structure

alignment methods: Scoring by geometric measures. J Mol Biol 346:11731188.

22. Melvin I, Weston J, Noble WS, Leslie C (2011) Detecting remote evolutionary

relationships among proteins by large-scale semantic embedding. PLoS Comput Biol

7:e1001047.

23. Levitt M, Gerstein M (1998) A unified statistical framework for sequence comparison

and structure comparison. Proc Natl Acad Sci USA 95:59135920.

24. Loewenstein Y, et al. (2009) Protein function annotation by homology-based infer-

ence. Genome Biol 10:207.

25. Subbiah S, Laurents DV, Levitt M (1993) Structural similarity of DNA-binding domains

of bacteriophage repressors and the globin core. Current Biol 3:141148.

26. deSilva V, Tenenbaum JB (2004) Sparse multidimensional scaling using landmark

points. (Stanford Univ, Stanford, CA).

http://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txt


7/23

Supplementary Material

Figure 1S: Scree plot of the 400 dimensional data. The Figure shows the 20 largest eigenvalues of the

(normalized) correlation matrix sorted in decreasing order; the insert shows the largest 200

eigenvalues of this matrix. The sharp drop up to the third eigenvalues suggests that three dimensions

can adequately represent the essential features of protein structure space.


8/23

Figure 2S: Density maps of protein structure space for sequence non-redundant subsets. The points

on the map are colored according to the number of domains that lie within 0.005 distance from them

in the dataset considered. In the left column, the map is of a subset of size 4238, in which thesequence identity between any two proteins is at most 95%; in the right column, that map is of a

b f i 2517 i hi h h id i i 40% Th l i ffi i


9/23

Figure 3S: Functional-diversity maps of protein structure space with vicinity defined as Vfn: The

points on the map are colored according to their functional diversity measured by the number ofdistinct GO-MF terms annotating the domains in the V fn vicinity of 100 nearest neighbors. In panels (a-


10/23

Figure 4S: Functional-diversity maps of protein structure space with vicinity defined as V fd: Thepoints on the map are colored according to their functional diversity measured by the number of

distinct GO-MF terms annotating the domains in the V fdi vicinity of distance 0 005 In panels (a-d) we


11/23

Figure 5S: The predicted functional diversity score vs. the functional diversity score calculated using the full dataset, for a

test set of 250 randomly chosen structures: We consider the three definitions of local vicinities: V samp, Vfn, and Vfd. We

calculate the projection to three-dimensions based on set that does not include the 250 test set proteins and their sequence

homologues. The predicted functional diversity of a test set protein is the number of unique GO-MF terms in the vicinity of

the location calculated for the structure using that projection to a lower dimension. In all three cases, the agreement of the

predicted functional diversity scores and the functional diversity score calculated using the full dataset is very good


12/23

Figure 6S:

Functional multiplicity map of

protein structure space. Each

protein structure is color-coded

by the number of its GO-MF

functional annotations. The

number of annotations is at most

7, and in the vast majority of the

cases (99.4%) it is less than three

(colored by shades of blue); the

top panel shows a bar diagram of

the number of annotations.Below, there are three views of

structure space. The high

functional diversity core seen in

Figure 2 in the main document is

not due to the small set of

proteins that are annotated by

unusually many functions (see

Figure 6S below for additional

evidence).


13/23

Figure 7S: Functional diversity map of protein structure space using only proteins annotated by

one function: We restrict our attention to domains annotated by at one GO term (61.04% of the

full data set). The correlation coefficients between the scores calculated using only this subset, and

when calculating using the full dataset are listed in Table 1S below. We see the high diversity core

in this dataset as well, meaning it cannot be explained away as a consequence of multiply-


14/23

Figure 8S: Functional diversity maps of protein structure space using the GO-Slim annotation:

Here we use a more restricted ontology for the annotation of function, and replace each

annotation by its most specific parent in the GO-slim graph (1). The maps on the left column use

the definition of vicinity Vsamp, and on the right Vfn. The correlation coefficients between thesemeasures and the straight-forward count of distinct GO-terms are listed in Table 1S below. We

h h i i hi hl di d h d i di i d h


15/23

Figure 9S: Functional diversity maps using alternative scoring functions I. The functional-diversity

score used to construct these maps is the weighted sum of distinct terms within a vicinity of a

domain; the weight of a term depends on how common it is in the dataset, with more common terms

contributing less. The maps on the left column use the definition of vicinity Vsamp, and on the right Vfn.

The correlation coefficients between these measures and the straight-forward count of distinct GO-


16/23

Figure 10S: Functional diversity maps using alternative scoring functions II. The functional-

diversity score used to construct these maps is the weighted sum of distinct terms within a

vicinity of a domain; the weight of a term is its specificity in the GO annotation graph, with more

specific terms contributing less (the inverse of its distance from the root). The maps on the left

column use the definition of vicinity V samp, and on the right Vfn. The correlation coefficients

between these measures and the straight-forward count of distinct GO-terms are listed in Table1S below. Here too, we see our main finding: a highly diverse core surrounded by less diverse

regions


17/23

Figure 11S: Functional diversity maps alternative scoring function III. The functional-diversity

score used to construct these maps is the 'coherence measure' of the functional GO-MF terms

in the neighborhood of a protein, as suggested in (4) (5). Thus, this score ranges between 0-

100%, and higher values imply proteins centered at regions that are less functionally diverse

(since all their terms are unique to that region (see Methods for details). The maps on the left

column use the definition of vicinity V samp, and on the right Vfn. The correlation coefficients

between these measures and the straight-forward count of distinct GO-terms are listed in Table


18/23

Figure 12S: Functional density scores when sampling the full dataset based on sequence. The vicinity

of a protein is V fd, and the diversity score is the number of distinct GO-MF terms in the annotations of

the proteins in the vicinity. The correlation coefficients of this diversity score and the straightforward

one using the full dataset is listed in Table 1 below. Even when using this very sparsely sampled

subset, we see the same core of high functional diversity.


19/23

Figure 13S: Functional diversity by SCOP class with vicinity Vfn and Vfd: We calculate the

separate histograms of functional diversity for each of the SCOP classes, and stack them one ontop of the other. Table 2S lists the exact proportions of each of the SCOP classes, among the

top 10%/20% most dense/functionally diverse domains. This supports Figure 3 in that the most

functionally diverse regions are populated by the alpha/beta domains.


20/23

Figure 14S: A map of Watson et al. (1) "known-function" dataset. The right panel shows the

functional diversity of our complete dataset, and the 90 proteins in Watson et al.'s [14] set are

shown as black dots. The left panel shows the same structures and their marker depends on if

their function was correctly predicted using global structural similarity (triangles), or not

(circles). We see that the function of proteins in the periphery of structure space tends to be

more accurate than the function of the proteins in the core. This statement is quantified in the

Results section.


21/23

Table 1S: Pearson correlation coefficients between the functional diversity score1

on the

full dataset and alternative scores/datasets

Alternative

set

Vicinity

definition

Using only

domains

annotated withone function

Functional

diversity

calculate usingGO-Slim

Weighted score by

1-10*(fraction of

term occurrence)

Weighted score

by 1/distance

from root

Figure 6S Figure 7S Figure 8S Figure 9S

Vsamp 0.8779 0.8952 0.9394 0.9264

Vfn 0.8860 0.9299 0.9977 0.9885

Vfd 0.9747 0.9567 0.9997 0.9983

Alternative

set

Vicinity

definition

Coherence as a

functional (non)

diversity score

Density

Function

Sampling using 95%

sequence non

redundant

Sampling using

40% sequence

non redundant

Figure 10S Figure 1 Figure 11S Figure 11S

Vsamp -0.7934 0.4306 0.6141 0.6468

Vfn -0.8117 0.0926 0.2969 0.3423

Vfd -0.3030 0.7320 0.8641 0.8881

Table 2S: SCOP class composition of the most functionally diverse domains

Data

Set

Vicinity

definition

Within

top

% all

alpha

% all

beta

% alpha /

beta

%alpha +

Beta

%

others

Full Vsamp 10% 1.99 0.10 78.06 15.19 4.65

20% 2.93 0.20 76.45 15.55 4.87

Full Vfn 10% 2.89 1.04 72.35 19.62 4.10

20% 3.56 1.35 70.52 19.84 4.73

Full Vfd 10% 1.52 0.13 81.84 12.15 4.36

20% 4.18 0.15 74.93 15.73 5.02

NR

(95%)

Vfd 10% 4.16 0 79.95 10.76 5.13

20% 19.01 0.12 63.40 11.92 5.55

NR

(40%)

Vfd 10% 8.54 0 75.61 9.76 6.10

20% 20.98 0.20 60.49 11.81 6.52

1

The functional diversity score of a protein domain is the number of di stinct GO-MF terms annotatingthe set of domains that are in the vicinity of that domain.


22/23

Table 3S: SCOP folds that lie in the functionally diverse core

Vfd Vfn Vsamp

Top 20 means Top 20 medians Top 20 means Top 20 medians Top 20 means Top 20 medians

SCOP Fold Mean

Score

SCOP Fold Median

Score

SCOP Fold Mean

Score

SCOP Fold Median

Score

SCOP Fold Mean

Score

SCOP Fold Median

Score

c.117 152.5 c.117 155.5 c.24 45.2 c.88 46 c.117 73.7 c.117 73

c.42 148.5 c.74 151.5 c.88 44.8 c.24 46 c.42 71.2 c. 74 72.5

c.14 133.8 c.42 151 c.6 43.9 c.74 44 c.74 69.9 c.42 72c.74 132.8 c.14 150 c.69 43.6 c.69 44 c.67 68.2 c.67 69

c.6 128.9 c.24 135 d.165 42.6 c.6 44 c.24 68.0 c.24 69

c.24 128.1 c.6 133 c.56 42.3 a.137 43.5 c.6 67.4 d.95 68

c.67 121.8 c.93 131 c.66 41.7 c.56 43 c.14 67.1 c.36 68

c.93 118.1 d.95 122 c.74 41.6 d.165 42 c.36 66.6 c.14 68

c.36 110.6 c.67 121 c.117 41.5 c.93 42 d.174 66.1 c.6 68

d.174 107.8 d.96 115.5 c.41 41.5 c.66 42 e.26 65.8 c.69 67

c.1 107.4 c.36 114 c.42 41.3 c.41 42 c.93 65.7 e.26 66

d.95 107.3 c.1 113 c.14 41.2 c.23 42 c.69 65.3 d.174 66

c.60 105.8 d.174 112 c.23 41.1 c.14 42 c.79 64.7 c.93 66

c.79 103.8 c.69 111 c.26 40.9 d.144 41 c.60 64.7 c.56 66

c.69 103.6 c.60 111 c.45 40.7 c.117 41 c.1 64.6 c.1 66

c.56 100.1 c.56 106 a.137 40.7 c.61 41 c.7 64.4 d.144 65

c.7 99.2 c.80 102 c.53 40.4 c.53 41 c.39 64.3 d.96 65

d.96 97.5 c.79 101 c.60 40.4 c.45 41 c.56 63.7 c.79 65

d.144 97.0 c.7 101 d.144 40.3 c.42 41 d.144 63.1 c.60 65

c.39 95.8 d.144 100 c.61 40.2 c.26 41 c. 72 62.4 c.7 65


23/23

Documents

Maps of protein structure space reveal a fundamental relationship between protein structure and function