Maps of protein structure space reveal a fundamental relationship between protein structure and function

Embed Size (px)

Citation preview

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    1/23

    Maps of protein structure space reveal a fundamentalrelationship between protein structure and functionMargarita Osadchy and Rachel Kolodny1

    Department of Computer Science, University of Haifa, Mount Carmel, Haifa 31905, Israel

    Edited by Sung-Hou Kim, University of California, Berkeley, CA, and approved June 7, 2011 (received for review February 20, 2011)

    To study the protein structurefunction relationship, we propose a

    method to efficiently create three-dimensional maps of structure

    space using a very large dataset of >30,000 Structural Classification

    of Proteins (SCOP) domains. In our maps, each domain is repre-

    sented by a point, and the distance between any two points

    approximates the structural distance between their corresponding

    domains. We use these maps to study the spatial distributions of

    properties of proteins, and in particular those of local vicinities in

    structure space such as structural density and functional diversity.

    These maps provide a unique broad view of protein space and thus

    reveal previously undescribed fundamental properties thereof. Atthe same time, the maps are consistent with previous knowledge

    (e.g., domains cluster by their SCOP class) and organize in a unified,

    coherent representation previous observation concerning specificprotein folds. To investigate the functionstructure relationship,

    we measure the functional diversity (using the Gene Ontology con-

    trolled vocabulary) in local structural vicinities. Our most striking

    finding is that functional diversity varies considerably across struc-

    ture space: The space has a highly diverse region, and diversity

    abates when moving away from it. Interestingly, the domains in

    this region are mostly alpha/beta structures, which are known to

    be the most ancient proteins. We believe that our unique perspec-

    tive of structure space will open previously undescribed ways of

    studying proteins, their evolution, and the relationship between

    their structure and function.

    global map of protein universe protein function prediction

    protein structure universe

    Investigating protein structure space and its relationship to func-tion space is a fundamental scientific challenge. Characterizingthis relationship may also carry practical implications to proteinfunction prediction, whereby one wishes to infer the biologicalrole of a protein from its structure [as is the case with manyof the structures solved in the high-throughput pipeline of theStructural Genomics projects (1, 2)]. One way to approach thischallenge is to represent protein structure space by three-dimen-sional maps. Maps of structure space were first introduced byHolm and Sander (3) and were later used by Kim and colleagues(46). To calculate their maps, they first calculate the structuralsimilarity between all pairs of protein structures. Then, they usemultidimensional scaling (MDS) to find a collection of points in

    three dimensions, each of which corresponds to a protein, andwhere the distance between any two points depends on the struc-tural similarity of the proteins they represent. Such a representa-tion provides a comprehensive visual view of structure space,

    which is not constrained by a hierarchical system such as theStructural Classification of Proteins (SCOP) (7).

    We propose an efficient way to calculate maps of protein struc-ture space, using the recently introduced FragBag model (8).Using FragBag, we represent each structure as a point in a high-dimensional space and project these points to three dimensions.It was recently shown that the similarity between the FragBag

    vectors, or the points in the high-dimensional space, can identifynear structural neighbors as accurately as the state-of-the-artstructural aligners STRUCTAL and CE, for several definitionsof near structural neighbors (8). Because FragBag models struc-

    tures as fixed-size vectors, we can replace MDS with a more effi-cient procedure, Principal Component Analysis (PCA) (9). Thus,

    we can map a very large set of >30;000 protein structures. Ratherthan studying single structures, we study properties such asstructural density and functional diversity, which are defined ateach point of structure space through a whole collection of struc-tures in the vicinity of that point. By coloring the maps accordingto the values of these properties, we are able to visualize theirdistribution across structure space. This way we discover thatstructure space has a region of high functional diversity and thatthis region consists mainly of alpha/beta structures, which areknown to be the most ancient proteins (10). We believe thatstudying such maps holds great promise to revealing important

    properties of protein structure space, its relation to function, andperhaps even to sequence.

    Results

    Constructing Functional Diversity Maps of Protein Structure Space. Tostudy protein structure space we analyze a set of 31,155 SCOP

    v1.71 (7) domains. We initially represent each such domain bya 400-long FragBag vector, which may be thought of as a pointin 400-dimensional space. In the FragBag model, a protein struc-ture is represented by a count vector of backbone fragments takenfrom a library of 400 commonly occurring 12-residue fragments.For each contiguous (and overlapping) 12-residue segment alongthe protein backbone, we identify the library fragment that fits itbest in terms of RMSD after optimal superposition. The ith entryin the FragBag vector is the number of times the librarys ith

    fragment was found to be the best fit. The FragBag distance be-tween two domains is the distance between their FragBag vectors.We have recently shown that this distance is a good approxima-tion of the structural distance, as quantified by structural align-ment (8). Using principal component analysis (PCA) (11), wethen project the points to three-dimensional space. The eigenva-lues of the resulting data covariance matrix (Fig. S1) drop sharplyand the fourth largest eigenvalue (0.0106) is 8% of the firstlargest eigenvalue (0.1326); this indicates that three dimensionscan adequately represent the essential features of protein struc-ture space. Fig. 1 BD shows a three-dimensional map of proteinstructure space, in which each domain is colored by its SCOPclass (7); we show three views of the map from three angles, toget a better sense of it. As expected, the domains cluster by their

    SCOP class.The density of protein structure space is uneveni.e., certainregions have more domains per unit volume than others. Thiscan be seen in Fig. 1 FH, which shows again the three viewsof the map, now colored according to the density score of eachdomainthe number of domains that are within a 0.005 distance

    Author contributions: M.O. and R.K designed research, R.K performed research; R.K

    analyzed data; and M.O and R.K wrote the paper.

    The authors declare no conflict of interest.

    This article is a PNAS Direct Submission.

    1To whom correspondence may be addressed. E-mail: [email protected] or trachel@cs.

    haifa.ac.il.

    This article contains supporting information online at www.pnas.org/lookup/suppl/

    doi:10.1073/pnas.1102727108/-/DCSupplemental.

    www.pnas.org/cgi/doi/10.1073/pnas.1102727108 PNAS July 26, 2011 vol. 108 no. 30 1230112306

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdf
  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    2/23

    from it. Certain proteins are more studied than others, and as aresult, more variants thereof are included in our dataset. To ruleout this bias as the source of the observed uneven density, weprepared similar density maps, based on 40% and 95% sequencenonredundant subsets of the original data (containing 2,517 and4,238 domains, respectively). The results, shown in Fig. S2, arequalitatively identical to the original density map, and the corre-lation between the original density scores and those based on therestricted sets are very high (r 0.945 and r 0.960 for the 40%and the 95% sets, respectively). In the remainder of this study, weuse the full dataset.

    Upon inspecting Fig. 1, one can see that there is a relation

    between the SCOP class and the density score of a domain.Fig. 2A, which is a histogram of the density scores of the domains,color-coded by their SCOP class, shows this more clearly: Thealpha beta (yellow) and the all-beta (red) domains tend to re-side in low-density regions, whereas the all-alpha (blue) domainsconstitute the vast majority in the very high-density regions.

    Next, we investigate how functional diversity varies acrossstructure space; for this, we quantify the functional diversity inthe vicinity of each domain in our dataset. We consider three de-finitions for the vicinity of a domain d: (i) Vfn is a fixed number(100) of the nearest structural neighbors ofd, (ii) Vsamp is a sam-ple of fixed size (100) from the domains that lie within a fixedstructural distance (0.005) from d, and (iii) Vfd is the collectionof all domains that are within some fixed structural distance(0.005) from d. Although Vfd is perhaps the most natural defini-

    tion, it makes the vicinities of domains in denser regions containfar more members, which may bias the results.

    Our measure for the functional diversity in a vicinity of a pro-tein, however vicinity is defined, is the number of distinct func-tions that the domains within this vicinity possess. To determinefunction, we use the functional annotations of the proteins fromthe Gene Ontology molecular function (GO-MF) controlled

    vocabulary (12), and the mapping of terms to SCOP domainscalculated by Lopez and Pazos (13). When a single domain is an-

    notated as having more than one function, we include all its func-tions toward the count.

    Structure Space Has a Core of High Functional Diversity. Fig. 3 BDshows a functional diversity map of protein structure space. Thedomains in the map are color-coded according to the functionaldiversity of their vicinities (red for the most diverse ones; blue forthe least diverse), and vicinity is defined to be Vsamp (when there

    were fewer than 100 domains within this distance, all wereincluded). This map shows a striking pattern: Protein space hasa highly diverse core, and diversity drops gradually toward itsperiphery (we denote the high diversity region core, because ofits location in our maps). Figs. S3 and S4 show the maps con-structed using the two alternative definitions of a vicinity, Vfnand Vfd; the results are very similar.

    As a control for the validity of our finding, we re-created thediversity map (using Vsamp again) after randomly permuting thefunctional annotations across all domains (i.e., the set of func-tional annotations originally associated with each domain wasassociated with a different, randomly chosen domain). If our find-ing were merely an artifact of the projection to three dimensions,or of some feature of protein structure space (say, the unevendensity), the resulting diversity map would show again a highlydiverse core. Fig. 3 FHshows that this is not the case: Under the

    Fig. 1. Maps of protein structure space. Each point represents a SCOP

    domain, and the distance between any two points approximates the struc-

    tural distance between their corresponding domains. BD show the map of

    the SCOP classes: As expected, the points are clustered. FH show the struc-

    tural density map, where the color of each point indicates the number of

    domains that lie in its vicinity of fixed distance (denoted Vfd). We see that

    the highest density is within the regions of the all-alpha domains, followed

    by a region in the alpha/beta domain and in the all-beta domain. Fig. S2

    shows a similar density map when considering sequence nonredundant

    samples of the protein world.

    B

    A C1,600

    Fig. 2. Structural density and functional diversity by SCOP class. We calculate

    the separate histograms of structural density (A) and functional diversity

    (B) of each of the SCOP classes and stack them one on top of the other.

    We see that the densest regions are populated by all-alpha domains, and the

    most functionally diverse regions by the alpha/beta domains. See Table S2

    (listing the exact proportions of each of the SCOP classes, among the top

    10%20% most dense/functionally diverse domains) and Fig. S12 for support-

    ing evidence.

    12302 www.pnas.org/cgi/doi/10.1073/pnas.1102727108 Osadchy and Kolodny

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdf
  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    3/23

    random permutation, the diversity score of almost all domains isvery high (colored in orange and red), and the map has no pro-

    minent diverse core; the relatively few domains with low diversityscores (colored in blue) are mostly isolated domains, having fewerthan 100 neighbors within a 0.005 distance, and thus necessarilyless diverse vicinities (there are 5,356 such domains). The exis-tence of the diverse core is indeed a statistically significant finding(p < 0.005; see Methods for details). When using Vfn, the resultsare very similar (Fig. S3); as expected, when using Vfd, diversity ishighly correlated (r 0.953) with density, because domains indenser regions now have more members in their vicinities, andthus more functional annotations (Fig. S4).

    We can reliably predict the functional diversity of structures ina randomly chosen test set, using the mapping calculated for atraining set. Our test set consists of 250 randomly chosen struc-tures from the sequence nonredundant set (using a 40% sequenceidentity threshold); it has 52, 40, 92, 52, and 14 domains of the

    SCOP classes all-alpha, all-beta, alpha/beta, alpha beta, andothers, respectively. The training set has the 29,014 domains thatshare no sequence similarity with the test set proteins (BLASTE-value threshold of103 and sequence identity of 40%). UsingPCA of the training set FragBag data, we calculate the projectionPtrain to R

    3. For each test set proteins p, we calculate Ptrainp andidentify the structures in ps training-set vicinity. The predictedfunctional diversity score is the number of unique GO-MF terms

    within this vicinity. Fig. S5 plots the predicted functional diversity

    scores vs. the ones calculated using the complete dataset for thethree definitions of vicinity, Vsamp, Vfn, and Vfd, and shows thatthese scores are highly correlated (r > 0.96).

    A potential explanation for the high functional diversity inthe core is that the core contains a high proportion of multiple-function domains, compared to the periphery (recall that multi-ple-function domains contribute all their functions toward thediversity). This is not the case: Fig. S6 shows a functional multi-plicity map of structure space, i.e., a map in which each point iscolored according to the number of GO-MF annotations of thedomain it represents. The high functional diversity core seen inFig. 3 and Fig. S3 does not overlap with a region of high func-tional multiplicity. Further, we see the highly diverse core evenafter reconstructing the functional diversity maps using onlydomains annotated by only one function (61% of the data);

    see Fig. S7.Another potential, yet invalid, explanation for the high func-

    tional diversity in the core is related to the uneven degree ofdetail in the GO-MF vocabulary. The GO is implemented as ahierarchical directed graph, in which the terms are placed atthe nodes and the edges direct from the general to the specific.The level of detail in the GO-MF graph is uneven: Some areasare better studied and correspond to subgraphs of the GO-MFgraph that have more levels and, ultimately, more functionalannotations. In addition, proteins of the same function some-times have annotations at different levels (14). One could arguethat perhaps the proteins that lie in the core happen to have func-tions that are described in finer detail, and the apparent highdiversity of this core is merely an artifact of the uneven level ofdetail in the GO-MF graph. To demonstrate that this, again, is notthe case, we create functional diversity maps based on Watson etal.s GO-slim controlled vocabulary (14). GO slim is a trimmed

    variant of GO-MF in which function is defined more broadly, byonly 190 terms (out of >7;800); in particular, GO slim targets alevel of detail in which neighboring proteins in structure spacehave similar functions. Fig. S8 shows a map that was constructedsimilarly to the one in Fig. 3 and Fig. S3, except that the functionannotations are replaced by their more general terms in theGO-slim graph. Once again we see the same phenomenon: adiverse core and more homogeneous periphery. Indeed, these al-ternative scores are highly correlated with the original diversityscore (r > 0.895); see Table S1.

    We also consider three alternatives to the functional diversityscore used above. Two of these alternatives are based on a

    weighted count of distinct GO-MF terms within a vicinity, ratherthan on a simple count. In the first, commonly occurring termshave a lower weight, and in the second, more specific terms(i.e., ones that are farther from the root in the GO-MF graph)have a lower weight. In the third alternative, the score is basedon the coherence measure proposed in refs. 15 and 16, whichquantifies the contribution of a functional annotation term toa vicinity based on statistical tests. When using vicinity definitionsVsamp, and Vfn, these alternative scores are correlated with theoriginal diversity score (r > 0.79); see Table S1. Indeed, the func-tional diversity maps under each of the three alternative scores,shown in Figs. S9S11, look similar to the one in Fig. 3.

    Characterizing the Cores Structures. A comparison of the func-tional diversity maps (Fig. 3 BD) and the SCOP-class maps

    Fig. 3. Functional-diversity map of protein structure space. The color of a

    point indicates the degree of functional diversity measured by the numberof distinct GO-MF terms annotating the domains in its vicinity. Here, we use

    the Vsamp definition for a vicinity of a protein: a sample of fixed size from all

    domains that fall within a fixed distance from it. AD show the functional

    diversity for the true data; EH show the functional diversity of a random

    world, in which the proteins have the same structures, yet their functions

    are assigned at random. We see that when using the true functional annota-

    tions, there is a core of high functional diversity, and that functional diversity

    drops toward the periphery. Alternatively, when the functions are assigned

    at random, there is no such core, and function diversity is uniformly high. The

    figures in SI Appendix, and Table S1, show that the results are qualitatively

    similar when using alternative datasets, scoring functions, and the more

    uniform (coarser) annotation graph GO slim.

    Osadchy and Kolodny PNAS July 26, 2011 vol. 108 no. 30 12303

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdf
  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    4/23

    (Fig. 1 BD) reveals that the core of high functional diversity con-sists mainly of alpha/beta domains (colored in green). Fig. 2B andFig. S13 A and B show this finding in another way, via histogramsdetailing the contribution of each SCOP class to the functionaldiversity scores. Table S2 lists the exact proportions of the variousSCOP classes among the domains with the top 10% and 20%functional diversity scores; in all cases (including when consider-ing the diversity scores only within the sequence nonredundantsets), the majority of the high functional diversity domains arealpha/beta proteins.

    Fig. 4 highlights several SCOP folds that lie in the diverse coreof structure space. The full dataset is shown in Fig. 4A with ablack outline; Fig. 4 BF show specific SCOP folds within this

    outline, alongside the histograms of their functional diversityscores. The most obvious candidate for the SCOP fold whosestructures lie in the core is the TIM barrels (c.1), which are wellknown to accommodate many functions (2). Indeed, these lie inthe core, and their functional diversity scores are clearly highercompared with the full dataset (Fig. 4B). We see, however, thatthe TIM barrels are only a part of the picture, as the core containsalso many other domains. Fig. 4C shows SCOP fold adenine nu-cleotide alpha hydrolase-like (c.26) that was also noted as accom-modating many functions (1) and also lies within the core.

    To identify more SCOP folds in the core, we search for foldswith (more than 25) domains that lie in functionally diverse vici-nities. We quantify the diversity of a SCOP fold by the averageand the median of the diversity scores of its domains, using thediversity scores based on the three definitions of vicinity. Table S3

    lists the 20 most diverse folds under these measures: Each of theresulting six measures identifies different SCOP folds as the mostdiverse. To identify SCOP folds that are truly diverse, we considerfolds that are among the 20 most diverse folds under all sixmeasures. Nine folds satisfy this condition: 7-stranded beta/alphabarrel (c.6), ClpP/crotonase (c.14), methylglyoxal synthase-like(c.24), arginase/deacetylase (c.42), phosphorylase/hydrolase-like(c.56), alpha/beta-hydrolases (c.69), AraD-like aldolase/epimer-ase (c.74), amidase signature enzymes (c.117), protein kinase-like (PK-like) (d.144). As expected, the domains in these folds areindeed located in the core; Fig. 4 DF shows three examples.

    Better Predicting of Function from Structure in Regions of Low

    Functional Diversity. We use the set of 90 proteins* studied by Wat-son et al. (14) to assess if one can indeed better predict functionfor proteins in regions of structure space having low functionaldiversity. Watson et al. predicted function using global structuralsimilarity [as detected by secondary-structure matching (SSM)(17)] and evaluated the correctness of their predictions. Fig. S13maps the protein structures used in their experiment: on the rightthese structures within our dataset, and on the left, the samestructures with markers indicating if the prediction was correct.We see that Watson et al. better succeed in predicting the func-tion of proteins that lie in regions of low functional diversity.

    Fig. 4. SCOP folds that lie in the functionally diverse core. We highlight the location in structure space of specific SCOP folds and show histograms of the

    diversity of the domains of these folds; for comparison, A shows the full dataset (a copy of Fig. 3 A and B) outlined in black. B and Cshow two SCOP folds that

    are known to be functionally diverse, the TIM barrel fold (c.1) and the adenine nucleotide alpha hydrosase-like fold (c.26). Indeed, the domains of these two

    folds are located in the highly diverse core of structure space. There are, however, many other domains in the core. DF show three more examples of SCOP

    folds that lie in the highly diverse core: phosphorylase/hydrolase-like (c.56), alpha/beta-Hydrolases (c.69), and protein kinase-like (PK-like, d.144), respectively.

    Table S3 lists the mean and average functional diversity scores for several SCOP folds that lie in the core.

    *Denoted the known-function dataset; 1nrh, 1tea were removed because they are

    obsolete.

    12304 www.pnas.org/cgi/doi/10.1073/pnas.1102727108 Osadchy and Kolodny

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1102727108/-/DCSupplemental/Appendix.pdf
  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    5/23

    We quantify this difference by separating the proteins to twosets, according to their functional diversity, and comparing thesuccess rate in these sets. The first set consists of 35 proteins hav-ing high diversity (45) vicinities, and the second consists of 55proteins having low diversity (

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    6/23

    This is a generalization of a call for caution recently made withrespect to function prediction for TIM barrels (24). Indeed, ouranalysis of Watson et al.s data (14) shows that they were moresuccessful in predicting function from structure for proteins lyingin less diverse regions of structure space. Thus, it seems that onecould use the functional diversity maps to better choose the para-meters of structure-based function prediction, according to thelocation of the target protein in structure space, and perhaps evento assign confidence levels for the prediction.

    Materials and MethodsRepresenting Protein Domains in 400 Dimensional Space. For each domain in

    the dataset, we calculate FragBag (8) description vectors of length L 400

    based on a library of 400 12-mer fragments (http://cs.haifa.ac.il/~ibudowsk/

    libraries/centers400_12.txt); each entry in the vector is the number of times

    thecorresponding library fragment wasthe best approximation of anyof the

    12-mer fragments in the backbone of the represented protein. A list of one

    or more GO-MF annotations is associated with each domain. Our dataset

    includes N 31;155 SCOP v1.71 (7) domains for which Lopez and Pazos

    (13) provide a GO annotation. We have previously shown that the cosine dis-

    tance between two FragBag vectors best approximates the structural align-

    ment score (SAS) (25) between their corresponding structures (8). Notice that

    the Euclidean (norm 2) distance between two FragBag vectors that were nor-

    malized to length 1 is exactly twice their cosine distance. To see this, consider

    p1 and p2 two FragBag vectors, and let ^p1 p1p1

    , ^p2 p2p2

    be the normalized

    vectors. The cosine distance between p1 and p2 is 1 cosp1;p2 1 ^p1T ^p2;

    the Euclidean distance between the normalized vectors is

    ^p1 ^p2T^p1 ^p2 ^p1

    T^p1 ^p2T^p2 2^p1

    T^p2 2 2^p1T^p2

    21 ^p1T^p2:

    Thus, we normalize all FragBag vectors and consider the Euclidean (norm 2)

    distance; because all distances are relative, the uniform factor 2 is of no

    consequence.

    Projecting to Three Dimensions. We store the normalized descriptions of

    length L 400 of the N structures in our dataset in an L N matrix and

    project it to three dimensions using PCA. Namely, after centering the

    L N coordinates about the origin (by subtracting their mean), we calculate

    the L L covariance matrix (normalized by N) and find the eigenvectors cor-

    responding to its three largest eigenvalues. By multiplying theseeigenvectors(a 3 L matrix) by the L N data matrix, we find the 3 N matrix that is the

    projection of our data to three dimensions. There, the Euclidean (norm 2)

    distances between two 3D vectors is an approximation of their Euclidean

    (norm 2) distances in L dimensions. We emphasize that this requires only

    the easy computation of finding the top three eigenvalues and eigenvectors

    of the relatively small L L matrix. This is in contrast to the slightly different

    calculation done in previous studies: Given N structures, they calculate a sym-

    metric matrix D of size N N of all pairwise structural distances and use MDS

    to find the coordinates of the points representing these N structures in three

    (or two) dimensions (3, 5). The technical bottleneck in the MDS calculation is

    finding the top three (or two) eigenvalues and eigenvectors of an N N

    matrix derived from D (26); it is a challenging computation for datasets of

    several tens of thousands proteins. Indeed, the datasets in previous studies

    were smaller (e.g., less than 1,900 structures in ref. 6).

    Calculating Alternative Functional Diversity Scores.Each of the domains in thedataset has a list of its GO-MF terms; in each case, the terms are the most

    specific ones (rather than the term and all its parents). For each term, we

    calculate its weighted functional diversity in two ways: (i) (110* the fraction

    of its occurrence), where the fraction of its occurrence is the fraction of

    domains that are annotated by it; the scaling factor was determined to

    be 10, to better space the range of values in the dataset. (ii) The inverse

    of the depth of the term in the GO-MF annotation graph; the depth is

    the number of times we can replace the terms by more general ones until

    we reach the root. There are seven cases (out of 9,500) in which a term

    has two different depths, and these differ by at most three (this is a conse-

    quence of GO being a graph rather than a tree). In these cases, we use the

    average depth. To calculate the coherence measure, we check for each

    term and vicinity if the term is enriched in the vicinity, i.e., if it appears

    at a rate that is statistically significant. The coherence is the percent of

    the terms in a region that are enriched. Thus, the coherence measure is a

    value between 0100%, and high coherence implies low diversity and viceversa; see ref. 16 for more details. Finally, the GO-slim annotation of a func-

    tional term is the most specific parent(s) of the term that is present in the

    GO-slim annotation graph.

    Measuring the Spatial Spread of the Core in True and Random Associations of

    Functional Annotations to Structures. We measure the spatial spread of the

    most diverse domains by their average distance from their center of mass.

    We consider two definitions of the most diverse proteins: ( i) all domains

    whose diversity scores are greater than 0.8 max_diversity, where max_

    diversity is the highest diversity score found in our dataset; (ii) the 20% most

    functionally diverse proteins. We measure the spatial spread of the most

    diverse domains in our dataset, and in 300 random assignments of the func-

    tional annotations tolocations in structurespace. The average distance in the

    true dataset for these two definitions is 0.0860 and 0.1131, respectively. In

    the random permutations, the average distances are 0.3501 0.0151 and

    0.3499 0.0220, respectively, resulting in a p value < 0.0033.

    ACKNOWLEDGMENTS. We thank Yuval Nov, Golan Yona, Chen Keisar, and ouranonymous reviewers for their helpful comments. R.K was supported by theMarie Curie IRG Grant 224774.

    1. Redfern OC, Dessailly B, Orengo CA (2008) Exploring the structure and function

    paradigm. Curr Opin Struct Biol 18:394402.

    2. Friedberg I (2006) Automated protein function predictionthe genomic challenge.

    Brief Bioinform 7:225242.

    3. Holm L, Sander C (1996) Mapping the protein universe. Science 273:595603.

    4. Hou J, Jun SR, Zhang C, Kim SH (2005) Global mapping of the protein structure space

    and application in structure-based inference of protein function. Proc Natl Acad Sci

    USA 102:36513656.

    5. Hou J, Sims GE, Zhang C, Kim SH (2003) A global representation of the protein fold

    space. Proc Natl Acad Sci USA 100:23862390.

    6. Choi IG, Kim SH (2006) Evolution of protein structural classes and protein sequence

    families. Proc Natl Acad Sci USA 103:1405614061.

    7. MurzinAG, Brenner SE,HubbardT,ChothiaC (1995)SCOP: A structural classificationof

    proteins database for the investigation of sequences and structures. J Mol Biol

    247:536540.

    8. Budowski-Tal I, Nov Y, Kolodny R (2010) FragBag, an accurate representation of

    protein structure, retrieves structural neighbors from the entire PDB quickly and

    accurately. Proc Natl Acad Sci USA 107:34813486.

    9. Tenenbaum JB, Silva Vd, Langford JC (2000) A global geometric framework for

    nonlinear dimensionality reduction. Science 290:23192323.

    10. Winstanley HF, Abeln S, Deane CM (2005) How old is your fold? Bioinformatics

    21:i449i458.

    11. Jolliffe IT (2002) Principal Component Analysis (Springer, New York).

    12. Harris MA, et al. (2004) The Gene Ontology (GO) database and informatics resource.

    Nucleic Acids Res 32:D258261.

    13. Lopez D, Pazos F (2009) Gene Ontology functional annotations at the structural

    domain level. Proteins 76:598607.

    14. Watson JD, et al. (2007) Towards fully automated structure-based function prediction

    in structural genomics: A case study. J Mol Biol 367:15111522.

    15. Segal E, et al. (2003) Module networks: Identifying regulatory modules and their

    condition-specific regulators from gene expression data. Nat Genet 34:166176.

    16. Slonim N, Atwal GS, Tkacik G, Bialek W (2005) Information-based clustering. Proc Natl

    Acad Sci USA, 102 pp:1829718302.

    17. Krissinel E, Henrick K (2004) Secondary-structure matching (SSM), a new tool for fast

    protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr

    60:22562268.

    18. Kolodny R, Petrey D, Honig B (2006) Protein structure comparison: implications for

    the nature of fold space, and structure and function prediction. Curr Opin Struct Biol

    16:393398.

    19. Petrey D, Honig B (2009) Is protein classification necessary? Toward alternative

    approaches to function annotation. Curr Opin Struct Biol 19:363368.

    20. Sims GE, Choi IG, Kim SH (2005) Protein conformational space in higher order phi-Psi

    maps. Proc Natl Acad Sci USA 102:618621.

    21. Kolodny R, Koehl P, Levitt M (2005) Comprehensive evaluation of protein structure

    alignment methods: Scoring by geometric measures. J Mol Biol 346:11731188.

    22. Melvin I, Weston J, Noble WS, Leslie C (2011) Detecting remote evolutionary

    relationships among proteins by large-scale semantic embedding. PLoS Comput Biol

    7:e1001047.

    23. Levitt M, Gerstein M (1998) A unified statistical framework for sequence comparison

    and structure comparison. Proc Natl Acad Sci USA 95:59135920.

    24. Loewenstein Y, et al. (2009) Protein function annotation by homology-based infer-

    ence. Genome Biol 10:207.

    25. Subbiah S, Laurents DV, Levitt M (1993) Structural similarity of DNA-binding domains

    of bacteriophage repressors and the globin core. Current Biol 3:141148.

    26. deSilva V, Tenenbaum JB (2004) Sparse multidimensional scaling using landmark

    points. (Stanford Univ, Stanford, CA).

    12306 www.pnas.org/cgi/doi/10.1073/pnas.1102727108 Osadchy and Kolodny

    http://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txthttp://cs.haifa.ac.il/~ibudowsk/libraries/centers400_12.txt
  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    7/23

    Supplementary Material

    Figure 1S: Scree plot of the 400 dimensional data. The Figure shows the 20 largest eigenvalues of the

    (normalized) correlation matrix sorted in decreasing order; the insert shows the largest 200

    eigenvalues of this matrix. The sharp drop up to the third eigenvalues suggests that three dimensions

    can adequately represent the essential features of protein structure space.

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    8/23

    Figure 2S: Density maps of protein structure space for sequence non-redundant subsets. The points

    on the map are colored according to the number of domains that lie within 0.005 distance from them

    in the dataset considered. In the left column, the map is of a subset of size 4238, in which thesequence identity between any two proteins is at most 95%; in the right column, that map is of a

    b f i 2517 i hi h h id i i 40% Th l i ffi i

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    9/23

    Figure 3S: Functional-diversity maps of protein structure space with vicinity defined as Vfn: The

    points on the map are colored according to their functional diversity measured by the number ofdistinct GO-MF terms annotating the domains in the V fn vicinity of 100 nearest neighbors. In panels (a-

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    10/23

    Figure 4S: Functional-diversity maps of protein structure space with vicinity defined as V fd: Thepoints on the map are colored according to their functional diversity measured by the number of

    distinct GO-MF terms annotating the domains in the V fdi vicinity of distance 0 005 In panels (a-d) we

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    11/23

    Figure 5S: The predicted functional diversity score vs. the functional diversity score calculated using the full dataset, for a

    test set of 250 randomly chosen structures: We consider the three definitions of local vicinities: V samp, Vfn, and Vfd. We

    calculate the projection to three-dimensions based on set that does not include the 250 test set proteins and their sequence

    homologues. The predicted functional diversity of a test set protein is the number of unique GO-MF terms in the vicinity of

    the location calculated for the structure using that projection to a lower dimension. In all three cases, the agreement of the

    predicted functional diversity scores and the functional diversity score calculated using the full dataset is very good

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    12/23

    Figure 6S:

    Functional multiplicity map of

    protein structure space. Each

    protein structure is color-coded

    by the number of its GO-MF

    functional annotations. The

    number of annotations is at most

    7, and in the vast majority of the

    cases (99.4%) it is less than three

    (colored by shades of blue); the

    top panel shows a bar diagram of

    the number of annotations.Below, there are three views of

    structure space. The high

    functional diversity core seen in

    Figure 2 in the main document is

    not due to the small set of

    proteins that are annotated by

    unusually many functions (see

    Figure 6S below for additional

    evidence).

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    13/23

    Figure 7S: Functional diversity map of protein structure space using only proteins annotated by

    one function: We restrict our attention to domains annotated by at one GO term (61.04% of the

    full data set). The correlation coefficients between the scores calculated using only this subset, and

    when calculating using the full dataset are listed in Table 1S below. We see the high diversity core

    in this dataset as well, meaning it cannot be explained away as a consequence of multiply-

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    14/23

    Figure 8S: Functional diversity maps of protein structure space using the GO-Slim annotation:

    Here we use a more restricted ontology for the annotation of function, and replace each

    annotation by its most specific parent in the GO-slim graph (1). The maps on the left column use

    the definition of vicinity Vsamp, and on the right Vfn. The correlation coefficients between thesemeasures and the straight-forward count of distinct GO-terms are listed in Table 1S below. We

    h h i i hi hl di d h d i di i d h

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    15/23

    Figure 9S: Functional diversity maps using alternative scoring functions I. The functional-diversity

    score used to construct these maps is the weighted sum of distinct terms within a vicinity of a

    domain; the weight of a term depends on how common it is in the dataset, with more common terms

    contributing less. The maps on the left column use the definition of vicinity Vsamp, and on the right Vfn.

    The correlation coefficients between these measures and the straight-forward count of distinct GO-

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    16/23

    Figure 10S: Functional diversity maps using alternative scoring functions II. The functional-

    diversity score used to construct these maps is the weighted sum of distinct terms within a

    vicinity of a domain; the weight of a term is its specificity in the GO annotation graph, with more

    specific terms contributing less (the inverse of its distance from the root). The maps on the left

    column use the definition of vicinity V samp, and on the right Vfn. The correlation coefficients

    between these measures and the straight-forward count of distinct GO-terms are listed in Table1S below. Here too, we see our main finding: a highly diverse core surrounded by less diverse

    regions

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    17/23

    Figure 11S: Functional diversity maps alternative scoring function III. The functional-diversity

    score used to construct these maps is the 'coherence measure' of the functional GO-MF terms

    in the neighborhood of a protein, as suggested in (4) (5). Thus, this score ranges between 0-

    100%, and higher values imply proteins centered at regions that are less functionally diverse

    (since all their terms are unique to that region (see Methods for details). The maps on the left

    column use the definition of vicinity V samp, and on the right Vfn. The correlation coefficients

    between these measures and the straight-forward count of distinct GO-terms are listed in Table

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    18/23

    Figure 12S: Functional density scores when sampling the full dataset based on sequence. The vicinity

    of a protein is V fd, and the diversity score is the number of distinct GO-MF terms in the annotations of

    the proteins in the vicinity. The correlation coefficients of this diversity score and the straightforward

    one using the full dataset is listed in Table 1 below. Even when using this very sparsely sampled

    subset, we see the same core of high functional diversity.

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    19/23

    Figure 13S: Functional diversity by SCOP class with vicinity Vfn and Vfd: We calculate the

    separate histograms of functional diversity for each of the SCOP classes, and stack them one ontop of the other. Table 2S lists the exact proportions of each of the SCOP classes, among the

    top 10%/20% most dense/functionally diverse domains. This supports Figure 3 in that the most

    functionally diverse regions are populated by the alpha/beta domains.

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    20/23

    Figure 14S: A map of Watson et al. (1) "known-function" dataset. The right panel shows the

    functional diversity of our complete dataset, and the 90 proteins in Watson et al.'s [14] set are

    shown as black dots. The left panel shows the same structures and their marker depends on if

    their function was correctly predicted using global structural similarity (triangles), or not

    (circles). We see that the function of proteins in the periphery of structure space tends to be

    more accurate than the function of the proteins in the core. This statement is quantified in the

    Results section.

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    21/23

    Table 1S: Pearson correlation coefficients between the functional diversity score1

    on the

    full dataset and alternative scores/datasets

    Alternative

    set

    Vicinity

    definition

    Using only

    domains

    annotated withone function

    Functional

    diversity

    calculate usingGO-Slim

    Weighted score by

    1-10*(fraction of

    term occurrence)

    Weighted score

    by 1/distance

    from root

    Figure 6S Figure 7S Figure 8S Figure 9S

    Vsamp 0.8779 0.8952 0.9394 0.9264

    Vfn 0.8860 0.9299 0.9977 0.9885

    Vfd 0.9747 0.9567 0.9997 0.9983

    Alternative

    set

    Vicinity

    definition

    Coherence as a

    functional (non)

    diversity score

    Density

    Function

    Sampling using 95%

    sequence non

    redundant

    Sampling using

    40% sequence

    non redundant

    Figure 10S Figure 1 Figure 11S Figure 11S

    Vsamp -0.7934 0.4306 0.6141 0.6468

    Vfn -0.8117 0.0926 0.2969 0.3423

    Vfd -0.3030 0.7320 0.8641 0.8881

    Table 2S: SCOP class composition of the most functionally diverse domains

    Data

    Set

    Vicinity

    definition

    Within

    top

    % all

    alpha

    % all

    beta

    % alpha /

    beta

    %alpha +

    Beta

    %

    others

    Full Vsamp 10% 1.99 0.10 78.06 15.19 4.65

    20% 2.93 0.20 76.45 15.55 4.87

    Full Vfn 10% 2.89 1.04 72.35 19.62 4.10

    20% 3.56 1.35 70.52 19.84 4.73

    Full Vfd 10% 1.52 0.13 81.84 12.15 4.36

    20% 4.18 0.15 74.93 15.73 5.02

    NR

    (95%)

    Vfd 10% 4.16 0 79.95 10.76 5.13

    20% 19.01 0.12 63.40 11.92 5.55

    NR

    (40%)

    Vfd 10% 8.54 0 75.61 9.76 6.10

    20% 20.98 0.20 60.49 11.81 6.52

    1

    The functional diversity score of a protein domain is the number of di stinct GO-MF terms annotatingthe set of domains that are in the vicinity of that domain.

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    22/23

    Table 3S: SCOP folds that lie in the functionally diverse core

    Vfd Vfn Vsamp

    Top 20 means Top 20 medians Top 20 means Top 20 medians Top 20 means Top 20 medians

    SCOP Fold Mean

    Score

    SCOP Fold Median

    Score

    SCOP Fold Mean

    Score

    SCOP Fold Median

    Score

    SCOP Fold Mean

    Score

    SCOP Fold Median

    Score

    c.117 152.5 c.117 155.5 c.24 45.2 c.88 46 c.117 73.7 c.117 73

    c.42 148.5 c.74 151.5 c.88 44.8 c.24 46 c.42 71.2 c. 74 72.5

    c.14 133.8 c.42 151 c.6 43.9 c.74 44 c.74 69.9 c.42 72c.74 132.8 c.14 150 c.69 43.6 c.69 44 c.67 68.2 c.67 69

    c.6 128.9 c.24 135 d.165 42.6 c.6 44 c.24 68.0 c.24 69

    c.24 128.1 c.6 133 c.56 42.3 a.137 43.5 c.6 67.4 d.95 68

    c.67 121.8 c.93 131 c.66 41.7 c.56 43 c.14 67.1 c.36 68

    c.93 118.1 d.95 122 c.74 41.6 d.165 42 c.36 66.6 c.14 68

    c.36 110.6 c.67 121 c.117 41.5 c.93 42 d.174 66.1 c.6 68

    d.174 107.8 d.96 115.5 c.41 41.5 c.66 42 e.26 65.8 c.69 67

    c.1 107.4 c.36 114 c.42 41.3 c.41 42 c.93 65.7 e.26 66

    d.95 107.3 c.1 113 c.14 41.2 c.23 42 c.69 65.3 d.174 66

    c.60 105.8 d.174 112 c.23 41.1 c.14 42 c.79 64.7 c.93 66

    c.79 103.8 c.69 111 c.26 40.9 d.144 41 c.60 64.7 c.56 66

    c.69 103.6 c.60 111 c.45 40.7 c.117 41 c.1 64.6 c.1 66

    c.56 100.1 c.56 106 a.137 40.7 c.61 41 c.7 64.4 d.144 65

    c.7 99.2 c.80 102 c.53 40.4 c.53 41 c.39 64.3 d.96 65

    d.96 97.5 c.79 101 c.60 40.4 c.45 41 c.56 63.7 c.79 65

    d.144 97.0 c.7 101 d.144 40.3 c.42 41 d.144 63.1 c.60 65

    c.39 95.8 d.144 100 c.61 40.2 c.26 41 c. 72 62.4 c.7 65

  • 8/3/2019 Maps of protein structure space reveal a fundamental relationship between protein structure and function

    23/23