Protein network paper 2008

28 Current Protein and Peptide Science, 2008, 9, 28-38

1389-2037/08 $55.00+.00 © 2008 Bentham Science Publishers Ltd.

Proteins As Networks: Usefulness of Graph Theory in Protein Science

Arun Krishnan1, Joseph P. Zbilut2, Masaru Tomita1 and Alessandro Giuliani3,*

1Institute of Advanced Biosciences, Keio University, Tsuruoka City, Japan;

2Molecular Biophysics and Physiology

Dept., Rush University Medical Center, Chicago, USA; 3Environment and Health Dept., Istituto Superiore di Sanita’,

Viale Regina Elena 299, 00161, Roma, Italy

Abstract: The network paradigm is based on the derivation of emerging properties of studied systems by their representa-tion as oriented graphs: any system is traced back to a set of nodes (its constituent elements) linked by edges (arcs) corre-spondent to the relations existing between the nodes. This allows for a straightforward quantitative formalization of sys-tems by means of the computation of mathematical descriptors of such graphs (graph theory). The network paradigm is particularly useful when it is clear which elements of the modelled system must play the role of nodes and arcs respec-tively, and when topological constraints have a major role with respect to kinetic ones. In this review we demonstrate how nodes and arcs of protein topology are characterized at different levels of definition:

1. Recurrence matrix of hydrophobicity patterns along the sequence

2. Contact matrix of alpha carbons of 3D structures

3. Correlation matrix of motions of different portion of the molecule in molecular dynamics.

These three conditions represent different but potentially correlated reticular systems that can be profitably analysed by means of network analysis tools.

Keywords: Systems biology, protein folding, recurrence quantification analysis, molecular dynamics, computational biology.

INTRODUCTION

In an important paper [1], Hans Frauenfelder and Peter Wolynes focused on the peculiarity of protein science and specifically on the sequence-structure relation puzzle, where thorough and accurate knowledge of “first principles” and potentials (hydrophobic interactions, hydrogen bonding, size constraints etc.) acting at microscopic levels inform empiri-cal (and very inaccurate) predictions of the actual structure assumed by proteins when in solution. Yet the next level “mesoscopic” principles needed to predict the 3D structure of proteins remained essentially unknown.

Protein folding as a topic has intrigued scientists for many years. Different models have been proposed for pro-tein folding arising from a host of theoretical [2–5], simu-lated [6–9] or experimental [10-12] techniques. Among the many different models that have been proposed are the clas-sical nucleation-propagation model [13, 14], the nucleation-condensation model [15], the sequential and hierarchical model [15] and the modular model [16]. More recently, a unified model of protein folding that is based on the effective energy surface of a polypeptide chain, has been introduced by Wolynes et al. [17]. According to this unified model, pro-tein folding consists of a progressive organization of ensem-bles of partially folded structures that arise through multiple routes. A detailed review of the different models can be found in [12].

*Address correspondence to this author at the Environment and Health Dept., Istituto Superiore di Sanita’, Viale Regina Elena 299, 00161, Roma, Italy; E-mail: [email protected]

With the discovery of natively unfolded proteins and the increasing recognition of the role played by protein flexibil-ity, the original form of the sequence structure puzzle moved from a relatively clear (at least in principle) one-to-one map-ping from a residue position along a linear chain to a vector in a three dimensional coordinate system with a much more fuzzy formulation [18-20]. The perception of proteins as dynamical systems in which the relative positions of their residues vary in time, changed the nature of the se-quence/structure problem from a uniquely defined prediction task to a more fluid description (and prediction) of high level features such as flexibility, stability, biological activity and folding rate [20].

Proteins occupy a unique position in the hierarchy of natural systems, since they lie in a grey region between chemistry and biology. From the biological side, although any single protein would not be considered as alive, it does not take many of them (plus a bit of nucleic acid) before life-like behaviour begins to emerge (as in the case of viruses for example). Still more puzzling is the behaviour of such pro-tein systems like prion-like particles that are able to “rear-range” the structural configuration of other systems that come into contact with them [21].

From a chemical viewpoint proteins are linear hetero-polymers that, unlike most synthetic polymers are formed of basically non-periodic sequences of 20 different monomers [18]. While artificial, periodic, polymers are generally very large extended molecules forming a matrix, the great major-ity of proteins fold as self-contained structures determined by the sequence of monomers.

Not For Distribution

Proteins As Networks Current Protein and Peptide Science, 2008, Vol. 9, No. 1 29

Thus, we can consider the particular linear arrangement of amino acids as a sort of “recipe” for making a water-soluble polymer with a well-defined three-dimensional archi-tecture, keeping in mind that well-defined does not mean “fixed” for the above mentioned role of dynamics [18].

The task of interpreting this “recipe” is naturally solved by the specific physico-chemical environment (in terms of pH, hydrophobicity, ionic strength, molecular crowding, density, etc.) that transforms linear, static information into 3D dynamical information. Our task as modellers is to try to understand some of these transformation rules.

MULTI SCALE MODELLING

A protein, like any complex system, can be described at different scales of definition in both space and time. We can simplify this characterization by the identification of three main organizational levels: sequence, structure and dynam-ics.

The first step in protein modelling is finding the most convenient formalization for connecting these three different layers. Such a formalization should have the same basic ele-ments at all the three levels (so that a mapping can be easily performed) and, possibly, the same dimensionality of the solution space (so as to avoid inconsistencies).

The amino acid residues are the most natural basic ele-ments for protein modelling given that they are the mono-mers constituting the polymers under study and are common to all the three different layers. This ends up in a physico-chemical description of specific residues at the sequence layer, a position in space for structure and a relative mobility for dynamics. The different characterization of the same elements (aminoacid residues) relative to the three organiza-tion levels poses a problem for the “dimensionality” aspect of modelling: sequence is mono dimensional, and can be equated to a time series with time substituted by the order of the residues along the sequence, whereas structure is obvi-ously a three dimensional object. A molecular dynamics simulation lives in a M x N space with N being the number of residues and M (= 4) being the number of coordinates (3 spatial coordinates + time coordinate). These dimensionality considerations tacitly imply an absolute space existing inde-pendently from the particular system; what we call an extrin-sic geometry [22, 23]. In other words physico-chemical de-scriptors, or space coordinates have an existence of their own, independent of the particular protein we are describ-ing—they are an absolute system of coordinates.

On the contrary we can assume an intrinsic (relativistic) geometry in which the protein is no longer defined in terms of an absolute system of coordinates external to it, but by means of the relation between its basic elements in the dif-ferent spaces [22, 23]. Thus we are no longer concerned about the actual values of the space coordinates of residue 13 of protein A but simply about which residues are at a given distance from residue 13. Similar statements can be obvi-ously made for sequence physico-chemical descriptors (e.g. which residues have a hydrophobicity similar to residue 13) and dynamics (which residues move together with residue 13).

This shift of perspective allows us to project all three representations, namely, sequence, structure and dynamics onto N x N matrices whose elements are the relationships between the N residues in the different spaces. Thus in the case of sequence space, the protein will be represented by an N x N matrix whose columns and rows are the residues and whose elements are the similarities of the specific residue pairs with respect to a particular physico-chemical property [18]. In the case of structure, the same N x N matrix will have as elements the Euclidian distances between the pairs of residues [24], while in the case of dynamics the N x N ma-trix will report the correlation coefficients of residue trajec-tories in time [25]. In graph theory these matrices are called adjacency matrices to stress the fact they report the topologi-cal relations between the elements: these topological rela-tions can mirror ordinary 3D Euclidean space (contact ma-trix) or less intuitive (but nevertheless fully physically defin-able) spaces like the between-residues physico-chemical similarities (sequence) and the between-trajectories correla-tions (dynamics).

Fig. 1 reports a pictorial example of these different repre-sentations: quite obviously, the intrinsic geometry approach, taking into consideration the relationships between the ele-ments of the system instead of looking for an absolute repre-sentation, allows us to use the same mathematical object (an N x N symmetrical matrix) for describing all the different aspects of protein architecture and behaviour, thereby offer-ing a consistent basis for modelling. This provides an easy to use method for going from one layer to the other.

In panel a) of Fig 1 the dynamics of the 1-40 -amyloid peptide is reported [26]: the elements of the matrix are the covariances of the single residues trajectories in time, col-oured by a scale proportional to their entity. The texture of the figure clearly evidences the portion of the peptides that move together. Panel b) corresponds to the contact matrix of TNF- [24]. Here a binary formalization of spatial distances between residues is adopted and a dot is inserted in the ma-trix for every residue pair whose elements are at a distance less than 6 Å and whose distance along the sequence is less than three residues (so as to avoid trivial contacts). On both axes of the figures are reported the patches of the molecule that are known to be arranged into sheets so that the corre-spondence between the elements of secondary structure and patterns in the matrix is made evident.

Panel c) corresponds to a Recurrence Plot (RP) of the human P53 protein sequence coded in terms of the Miya-zawa-Jernigan hydrophobicity scale of the constituent amino acids [18]. As in the a) and the b) panels, here too the ele-ments of the matrix correspond to a relation between the corresponding i-th row and j-th column residues. In this case the relationship is the similarity between the hydrophobicity distribution of a patch of 4 residues having as the first ele-ment the i-th and j-th residue respectively. When the distance between the two distributions is lower than a predetermined threshold, a dot is blackened in the matrix thereby giving rise to the RP corresponding to a binary N x N matrix. It is im-mediately noticeable that the natively unfolded trans-activation region of P53 is represented as a strongly recur-rent (highly repetitive) box between residues 61 and 98 as well as the resemblance of this portion of different patches


30 Current Protein and Peptide Science, 2008, Vol. 9, No. 1 Krishnan et al.

along the sequence in the form of horizontal (vertical) lines in the plot. The N x N symmetric matrix thus is the common mathematical support for sequence, structure and dynamics of a specific protein.

In order to compare different proteins, or even different matrices of the same protein in a quantitative way we need to derive some invariants out of these symmetric matrices: these descriptors will allow us to compare the different N x N matrices on a quantitative basis [27].

DERIVATION OF QUANTITATIVE DESCRIPTORS FROM ADJACENCY MATRICES.

The network paradigm is the prevailing metaphor in bi-ology; thus we read about gene networks [28-30], metabolic networks [31, 32], ecological networks [33] as well as sig-nalling networks [34]. In its most basic definition, a network is a wiring diagram in which some elements (nodes) are con-nected by some relations (arcs, edges). The network can be analysed by both purely topological approaches in which all the nodes and edges are considered as equivalent (as in the case of binary matrices of recurrences and contacts), as well as dynamical approaches in which the relations take the form of differential equations (or correlations, conditional prob-abilities and so forth).

Here we will concentrate on static approaches in which the network (even when representing a dynamical system like molecular dynamics simulations) can be considered as

fully represented by a wiring diagram like the one reported in Fig. 2 in which the nodes are the residues and the arcs the relations between them (similarity in hydrophobicity distri-bution for sequence, covariance of motions for molecular dynamics, contacts for 3D structures).

There are a number of synthetic numerical descriptors of networks capturing the basic topological features of such systems. For the sake of simplicity, let us start with the sim-plest wiring diagram of all, in which every arc has the same value, i.e. each contact is a contact, each recurrence is a re-currence without going in deeply into the “intensity” of the relation between the two nodes. In the case of proteins this is a very satisfactory simplification that encapsulates the essen-tials of the studied systems.

The basic mathematical object to approach the topologi-cal study of networks is the so-called “graph”. The graph is defined as a tuple (V,E), with V as a set of vertices (or nodes) and E as a set of edges (or arrows, arcs). The degree of a node is the number of arcs connected to it [28].

The most basic “network” feature is the possibility to reach a given node i starting from another node j by a path along a graph: this possibility defines an equivalence relation “to be connected to” , that partitions a given graph G in equivalence classes called “components” made by all the nodes that are connected among them. The set of nodes mu-tually reachable inside a given graph is called a “connected component”. A graph G is called connected if it is made by

Fig. (1). The figure reports the network formalization in terms of NxN adjacency matrix for the three layers of protein formalization: a) dy-namics, covariance matrix of the motions of A 1-40 peptide, b) 3D structure, TNF-A contact matrix with the indication of secondary struc-ture elements, c) sequence: hydrophobicity recurrence plot of P53 protein.



only one component . The number of distinct paths connect-ing two nodes can be considered as an “index of connec-tivity” of the graph nodes: the greater the number of paths connecting the two nodes, the greater the two nodes are cor-related, and the higher their connectivity index.

This allows for a straightforward application of clustering algorithms able to delineate “supernodes”, i.e., groups of nodes highly connected among each other and forming func-tional modules. This metric property of topological graph representation was exploited in many different fields extend-ing from organic chemistry (where the graphs are the mole-cules with atoms as nodes and chemical bonds as edges [35], to social networks [36]. In the case of protein domains, a very clear application with aminoacid residues = nodes and contacts = edges can be found in [37]. In this paper, by means of a pure network topology, it is possible to identify conformations in the folding transition state (TS) ensemble, and provide a basis for the understanding of the heterogene-ity of the TS and denatured state ensemble as well as the existence of multiple folding pathways. Network topology is described by means of intuitive computations. Each node of the graph can be labelled by the number of nodes connected to it (degree): this gives rise to the degree distribution P(k) describing the “general wiring pattern” of the network hav-ing as the abscissa the number k of connections and as the ordinates, the number of nodes having k connections. In analogy with statistical mechanics, these distributions are defined as “scaling laws” [38, 39].

Fig. 3 shows two typical kinds of scaling laws (node de-gree distributions): the panel on the left shows a Poisson distribution in which there is a privileged scale of number of connections and a decreasing number of nodes having less than average or more than average links. The panel on the right depicts a so-called “scale-free” network [28, 38], in which there is a large majority of nodes with a low number of connections and a very small number of nodes having a large number of links. These highly connected nodes are called “hubs”. The tendency of having subsets of nodes strongly connected among them can be measured by the so-called aggregation coefficient. Consider a generic node i of the network having k(i) edges connecting it to other k(i) nodes. In order that these nodes possess the maximal connec-tivity (each node connected to each other) we should have a total number of edges equal to:

k(i) * k(i) 1( )

2 (1)

Expression (1) corresponds to the maximal number of connections among k(i) nodes when self connections are avoided. Thus, it is perfectly natural to define the aggrega-tion coefficient in terms of the ratio between the number of actually observed (Ei) and the maximal number of connec-tions expressed by (1). Thus, the aggregation coefficient relative to node i, Ci is expressed as:

Ci = 2*Ei

k(i) * k(i) 1( )( )

(2)

The aggregation coefficient for the entire network corre-sponds to the average of Ci over all the nodes. The operative counterpart of clustering tendency is the concept of modular-ity, that is, the possibility to isolate portions of a more gen-eral network that can be considered as partially independent sub-networks (also known as “modules”) that can be studied as such, without necessarily referring to the whole network. This is the same as the concept of “stable classification” in classical multidimensional statistics, in which “well be-haved” clusters are defined as collections of statistical units very near each other (in the network language having many mutual connections) and distant from the elements of the other cluster (in network language, having only a few arcs connecting elements of different modules) [28].

The above defined measures, when applied to N x N ad-jacency matrices (protein networks) in sequence, structure and dynamic spaces allow us to quantitatively describe pro-tein systems at different levels of specification by means of directly comparable measures. In the following we sketch a path from protein sequence through protein structure and protein dynamics network-based representations (by the agency of the relative N x N adjacency matrixes) showing how the use of a common mathematical support can help to derive some interesting conjectures about the protein folding process.

PROTEINS DISPLAY SIX RESIDUE LONG HYDRO-PHOBICITY PATTERNS ALONG THEIR SEQUENCE

Protein sequences are, with rare exceptions (e.g. fibrous polymerising proteins such as collagen or silk), quasi-random strings of symbols with scant evidence of order or

Fig. (2). Isomorphism between network graph formalization and adjacency matrix. The thickness of the edges of the graph and of the squares of the matrix is proportional to the intensity of the relation between the corresponding elements.



periodicity: a reliable estimate of the entropy reduction due to the autocorrelation of residues in an average protein se-quence is only about 1% [27].

Nevertheless, such quasi-random strings are the basic recipes producing refined three-dimensional structures that sustain specific physiological roles. Thus, the observed quasi-randomness may be a specious image obscuring the underlying meaning [27]. It is interesting to note that a simi-lar situation occurs in the case of human languages where it is almost impossible to generate meaningful texts using just periodic repetitions of symbols. Nevertheless, even if very weak, the presence of regularities in both human texts and protein sequences is of utmost importance for deriving im-portant hints about the underlying message. This concept was set forth in a very clear manner by Rackovsky [40] who was able to demonstrate the presence of specific “sequence signals” in terms of autocorrelation structures of different physical properties when used to code diverse sets of protein sequences. In [27], a set of 1141 eukaryotic protein se-quences (ftp://ftp.ebi.ac.uk/pub/contrib/swissprot/testsets) was analysed in search of such “syntactic” sequence regu-larities and of their possible biochemical role. The 1141 se-quences were encoded by means of different physico-chemical properties of residues, thereby obtaining different sets of numerical series, each series corresponding to a spe-cific protein sequence coded by a specific property of its constituent amino acids. Each sequence was submitted to Recurrence Quantification Analysis (RQA), a statistical method widely applied in many diverse fields [18]. As intro-duced before (see Fig. 1) the Recurrence Plot (RP) of a pro-tein sequence depicts the N x N square matrix of similarities (for the specific physico-chemical property taken into con-sideration) between patches of amino acids along the se-quences putting a dot for each similarity above a given threshold. The RP can be safely viewed as an adjacency ma-trix (and consequently a network) whose nodes are the resi-dues and whose links correspond to the scoring of a strong similarity in hydrophobicity (or any other property, but hy-drophobicity is the by far most explored one) for the corre-sponding residue pairs. Such pairs are called recurrent [18] and, in network language, correspond to a pair of nodes linked by an edge.

Fig. 4 reports two such RPs together with the correspond-ing sequence encoded by means of Myiazawa-Jernigan hy-drophobicity, which was demonstrated to be the property code showing the richest autocorrelation structure. RQA generates a numerical set of descriptors for RPs resembling the network invariants described above. The two most basic descriptors are:

1. Percent of Recurrence (REC): Percent of recurrent pairs (corresponding to aggregation coefficient)

2. Percent of Determinism (DET): Percent of recurrent pairs forming diagonal lines in the RP relative to the total of recurrent pairs. (each diagonal line is a path in the network, and DET is the number of paths; so DET corresponds to connectivity).

In order to measure the similarities of patches of residues along the chain we must create a moving “window” scanning the sequence corresponding to the length of the patches we want to compare. In other words, we must set the number of neighbours m, of a residue i so as to compare the distribution of the property with the one relative to the m neighbours of residue j. This window of length m will be moved by one residue after the other so to consider all the possible pairs of residues along the sequence. This window is called “embed-ding dimension” using typical terminology of non linear dy-namics (from which RQA derives). Another choice to be made is the radius r corresponding to the maximum allowed distance for two windows to be considered as recurrent (the details of the method are fully explained in [18]).

It can be observed from Fig. 5 that determinism (DET) reaches a maximum for the entire set of proteins at an em-bedding dimension of four residues; this corresponds to a “typical word length” of six if we consider the need (for scoring a deterministic line) to have at least three consecu-tive recurrent pairs to obtain a diagonal deterministic line. This tells us of a possible characteristic length of six for pro-tein quasi-repeats (patches of amino acids with similar hy-drophobicity distribution). It is worth noting that the proteins having the highest DET were those having both the highest percentage of natively unfolded (or very flexible) portions and the highest number of interactions with other proteins [27].

Fig. (3). Two possible scaling of the degree of nodes (k) and their relative frequency (p(k)), left: Poisson scaling, right: scale free distribu-tion.



Fig. (4). Two Recurrence Plots together with the hydrophobicity coded sequence that generated them: a) An enzyme (transferase, swiss-prot code Q07357) with a relatively low recurrence rate, b) A protein engaged in a lot of protein-protein interaction (GATA bind-ing factor, swiss-prot code: P52618).

The above evidence was reinforced by the analysis [41] of a different data set of 1977 single chain protein structures solved by x-ray diffraction and obtained from CATH v2.6.0 (April 2005) (http://cathwww.biochem.ucl.ac.uk/ lat-est/lists/index.html) in Cath List Format (CLF) .

In this set too (completely independent of the other one) a characteristic length of six for the “hydrophobic word” was evident. Moreover the size of the “hydrophobic word” was

completely independent of the total length of the protein in which the word is embedded [41], exactly as we would ex-pect for real words, whose length is clearly independent of the total length of the texts that they are part of.

Fig. (5). Determinism scaling at varying embedding dimension for two different choices of recurrence thresholds (radius). The relevance of a characteristic length of six residues is supported by other avenues of research. Schwartz and King [42] demonstrated a strong bias against blocks of hydropho-bic strings deviating from expected frequencies at about six residues of block length. In another study evaluating the po-tential for protein “knotting”, Lua and Grossberg [43] point out that knots are relatively rare, and that chains beyond six residues quickly increase their chances of interpenetration, thus promoting aggregation. Finally, an analysis of the nucleation cores based by Compiani et al. [44] on the basis of a structural entropy criterion shows an average length of 6.12 for these cores [27, 44]. This is a crucial point: Compiani et al. looked for sequence elements that maintained their local folding irrespective of the sequence they are embedded into [44]. The characteristic length of 6.12 they found tells us that six residues patches maintain their “individual features” thus possibly giving rise to a mutual recognition of “identical words” both inside the same protein (folding cores), as well as between different proteins (aggregation cores). The discovery of a characteristic length for the repetition of hydrophobically homogeneous residue patches along the sequence was obtained by purely topological considerations on the basis of an N x N matrix reporting the between residues similarities in hydrophobicity. This topological characteristic at the sequence level was demonstrated to have both structural (proteins with a highest number of repetitive patches tend to be natively unfolded) and functional (proteins with a lot of recurrent patches tend to have more protein-protein interactions). It is important to try to correlate this sequence level feature to analogous features at the level of 3D network based structure representation constituted by the so called contact matrix (Fig. 1b). In the next section we explore if the discovery of a characteristic length of six for hydrophobicity patches, highlighted by graph theoretical approaches [27, 41] and confirmed by many independent evidences [42, 43, 44] has a direct counterpart in the 3D structure network topol-ogy.



PROTEIN STRUCTURAL MODULES (6+6=12)

According to the conjecture of a six residues characteris-tic length acting as the basic folding unit, by the mutual rec-ognition of two similar “words” along the sequence, we must find some evidence of a basic structural unit of characteristic length 12 when shifting from sequence (RP) to 3D structure (contact matrix) representation of the proteins.

In network language this corresponds to the demonstra-tion of a characteristic size of 12 for network modules, i.e. for a portion of the networks whose nodes have a much larger numbers of contacts among them than with other por-tions of the network [45]. For accomplishing this task we used a subset of structures that share < 20% identity with each other and have been determined with a resolution < 2°A, resulting in a total of 1420 proteins that were culled from the PDB and obtained from the protein-culling server PISCES [46]. All entries contained a single chain. The iden-tification of network structural modules starting from the inter-residues contact matrix of the different proteins was carried out by the algorithm developed by Guimera and Amaral [45]. This algorithm, allows for an unbiased defini-tion of what a module is in terms analogous to the definition of a “well formed cluster” in multidimensional statistics: a module (cluster) is a set of nodes (in our case amino acid residues) maximizing the ratio: Within Cluster links (con-tacts) / Between Cluster links.

The algorithm converges towards the maximization of the modularity of the analysed network and allows for a rep-resentation of the single nodes (residues) in terms of their “intra-module degree”, z, and their “participation coeffi-cient”, P, corresponding to the relative position of a residue well inside the module (high z) or in an inter-modules fron-tier (high P). The algorithm was separately applied to the different proteins of the data set by means of an optimization method based on a Genetic Algorithm. Having subdivided all the 1420 proteins into modules we performed some statis-tics on module length.

Figure 6, reporting module length distribution for the entire data set gives a clear cut proof to our conjecture. The figure shows a distribution of the modules by size across the 1420 different proteins. It can be observed that the peak of this distribution lies around a module size of 12 amino acid residues. Even for structural modules we observed a basic invariance of size with respect to the length of the proteins they are embedded into.

This network based representation allows us to character-ize the single residues in terms of their topological role in the network by means of the P and z coefficients, i.e., in terms of the relative role of different residues in connecting different modules (P) or their central position inside the module (z).

The representation of the single proteins in the P vs. z space produces graphs with an invariant general shape, as is evident from Fig. 7 and Fig. 8 reporting the P vs. z space for a single protein and for the superposition of the 1420 graphs relative to the entire set respectively.

The invariance of the P vs. z plot for very different pro-tein structures forces us to look for a characterization of the different portions of these graphs in terms of a possible link

between the physiological role of residues and their repre-sentation in terms of network invariants.

Fig. (6). Distribution of structural modules frequency vs. length according to [45]. The maximal frequency of modules is achieved for a size of 12.

Fig. (7). Single protein (ubiquitin) P vs. z graph (see text for expla-nation).

Looking at the above figure, it is evident that P and z have a tendency to be negatively correlated. This is not strange if we consider the meaning of these two descriptors, P pointing to a role of inter module connector (and thus in many cases peripheral with respect to its own module), while high values of z point to a central position of the residue in-side its module. For this reason, of specific interest are those residues that are characterized by an extremely high absolute value of P/z ratio. These residues are those that attain a much higher “inter-module connecting role” with respect to what would be expected by the obvious P vs. z negative correla-tion. These residues are the best candidates to play the role of the so-called non-hub connectors indicated by Guimera and Amaral [45] as the most critical nodes in a network. This was actually the case for three analysed model systems ubiq-



Fig. (8). P vs. z graph superposition for the whole set of proteins (see text for explanation).

uitin (PDB: 1UBQ), hen lysozyme (PDB: 1E8L) and RNAse A (PDB: 7RSA) where residues with high, absolute P/z val-ues were demonstrated to correspond to residues that are protected during the transition state (unpublished results). The relevance of the description of protein residues in terms of the prediction of different folding features was recognized by different groups [37, 47, 48, 49]. Of particular interest is the work by Karplus and coworkers [50] that make an ex-plicit use of the aggregation coefficient averaged over all the residues of the protein (clustering coefficient).

C =1

N

2nk

nk nk 1[ ]k (3)

where N is the number of residues and nk is the number of links of the k-th residue, together with the graph path length corresponding to the average path connecting nodes in the graph.

These two network invariants were applied to 978 repre-sentative proteins from the PDB discovering a “small world” architecture of proteins (the same depicted in panel B of Fig. 3) with a few vertices working as hubs and in many cases corresponding to the key residues for folding. This is a cru-cial result, complementing the previously described finding by Rao and Caflish [37] on a much larger scale.

We can summarize the “take-home” message of this par-tial description of network based analysis of 3D structure of proteins by the following points:

The topology of residue contacts in the 3D structure al-lows for the detection of modules. These structural modules are reminiscent of the hydrophobic modules detected at the sequence level.

The residues at the frontier of two distinct modules play privileged roles in protein folding.

DYNAMICAL NETWORKS

As remarked in the Introduction, the peculiarity of pro-teins is their position in between simple and complex sys-

tems [1]: this make proteins as one of the most fruitful and intriguing territories for many diverse explorations carried on by scientists with very different backgrounds. Physically inclined scientists are particularly attracted by the field of molecular dynamics simulation, i.e., by the possibility of investigating the motions of proteins by means of computa-tionally intensive approaches. The basic recipe is more or less as follows: start with a known 3D structure defined in terms of mutual positions of the residues, add some solvent molecules (this is not mandatory but adds realism), put in-side all the potentials we already know that act on the above elements, define some general boundary conditions (i.e. temperature, ionic strength, pH, etc. in the correct physical formulation) and start the simulation by means of an algo-rithm that progressively adjusts the three dimensional coor-dinates of the protein residues applying to them, at each dis-crete step of the simulation, all the known potentials [25,26, 51,52].

The basic recipe can obviously be changed in myriads of possible directions to account for many different situations: insertion of a mutation, change of pH, change of solvent, etc.

In any case, the output of the process consists of a huge amount of numbers made of all the different positions, of all the atoms of the system, relative to all time steps. This situa-tion asks for some efficient formalization so as to derive an understandable message from this deluge of information. The most common way to accomplish this task is to collapse all the information to the covariance of the motions of differ-ent residues in time, ending in correlation maps similar to the one reproduced in panel a) of Fig. 1. Again we have an N x N symmetric matrix with rows and columns corresponding to the N residues of the studied protein and the elements to the covariance of the time series corresponding to the differ-ent positions in time of the residues [25,26]. The N x N co-variance matrix can be considered as a network (as any symmetric matrix reporting a measure of the relationship between the row and column elements), but in this case the relationships (covariances) are generally considered in terms of their actual quantitative values instead of being dichoto-mised (presence or absence of an edge) as we observed for sequence and structure networks. In network terms the co-variance matrix is a “labelled” graph in which each arc has a value (label) corresponding to the entity of the corresponding relation [28].

The most time honoured method to extract the relevant information from such covariance matrices is Principal Component Analysis (PCA). This is perhaps the most versa-tile statistical method ever developed, allowing for an appre-ciation of the studied phenomenon halfway between the “hard sciences”-style based on differential equations and the pure post-hoc statistical data analysis typical of biomedical studies [53].

The same algorithm (with minor modifications) takes the names of SVD (Singular Value Decomposition), SSA (Sin-gular Spectrum Analysis), or Karhunen-Loewe decomposi-tion depending upon the scientific discipline which employs it (e.g., physicists, climate scientists, engineers) [54]. Basi-cally the method consists of the extraction of the eigenvalues and relative eigenvectors of the N x N covariance matrix in order of explained variance. Thus the first components col-



lect the most important “motions” of the protein system [25]. Given that we are talking of a covariance matrix, most im-portant means “explaining the major portion of coherent dy-namics.” This is to say that the major components describe the coordinated displacements of protein domains. This is the analogue of looking for modules of the network system where a module is a portion of the molecule that has a coher-ent displacement of the constituent elements. Again we are working around the concept of “structure” in its very basic meaning of “optimal dissection of a whole into its parts and the connections between the parts”.

There is a significant literature on the application of ei-genvalues/eigenvector methods to molecular dynamics simu-lations to which the reader is directed for further reading [25, 55-57].

Here it is worth noting that the extraction of eigenvalues / eigenvector spectrum is a classical way to analyse different network systems in order to derive crucial information such as network modularity or stability [28,58]. Particularly inter-esting from our point of view is the analysis by means of SVD of the N x N distance matrix between side chains of protein residues in crystal structures so as to derive protein domains in a mathematically objective way [59].

How can we connect the dynamical perspective of mole-cules with the sequence and 3D structure views? In [60] the authors analyse the differences in the dynamical behaviour

of A 40, a peptide of 40 residues involved in the pathogene-sis of Alzheimer disease by means of the mutual aggregation of different monomers to form supra-molecular complexes endowed with neurotoxic activity.

The aggregation process was demonstrated [60] to be mediated by the presence of highly flexible regions along the molecule acting as “aggregation hot spots”. The amount of flexibility of these regions (measured as RMSD of the resi-dues) is deeply influenced by pH , thus the molecular dy-namics simulation was performed at 3 pH values; low (range 2-4), medium (range 5-6) and neutral (denominated L, M and N respectively) of which M corresponded to the highest flexibility of the molecule. The highest flexibility condition (M) had, as its experimental counterpart, a much higher ag-gregation rate of the peptide [26].

Figure 9 allows for an immediate appreciation of the link existing between more recurrent and deterministic patches along the molecule (as appreciated by the application of RQA on Myiazawa-Jernigan coded peptide sequence) and most mobile patches expressed by the histogram of RMSD of the single residues for the three pH conditions (Fig. 9, left panel). This correspondence is made more cogent by the smoothing of both REC and RMSD values along the se-quence as reported in the right panel of Figure 9.

This case story closes the circle initiated by the discovery in the 1141 eukaryotic proteins data sets of a statistical corre-

Fig. (9). On the left are reported the RP of A 1-40 peptide together with the RMSD of each residue in the three experimental condition, on the right the same situation is depicted by means of a moving average (smoothing) procedure on the same data.



lation between the presence of repetitive patches along the sequence and both the features of being natively unfolded (another name for high flexibility) and having many protein-protein interactions (the general case of protein aggregation). On the other hand the folding process is not qualitatively different from protein aggregation, the only difference being the intra-molecular (folding) as opposed to the inter-molecular (aggregation) character of interactions [61, 62]. The “folding” side of the coin is represented by the resem-blance between 3D structural modules (putative folding units) and hydrophobic “words” along the sequence we dis-cussed above.

The consideration of “dynamical networks” in the form of links (covariances) between trajectories of different resi-dues (nodes) of a protein system ends the sequence-structure-dynamics path we proposed as the main path of exploration of the use of graph theory based approaches in protein science. The study of the molecular dynamic simula-tion of A 40 highlights the dynamical counterpart of protein repetitive patches (modules of the sequence based network representation) as the most mobile and aggregation prone portions of the sequence, thus indicating a strong link be-tween the different sequence-structure-dynamics layers of protein description that was discovered by means of consid-ering proteins as network systems.

CONCLUSIONS

The above results indicate the feasibility and usefulness of a network formalization of proteins at different levels of definition. This formalization allows subtle sequence/ struc-ture/dynamics relationships to be clearly highlighted by the use of a common mathematical formalism derived from the consideration of a protein molecule as a network.

The network formalization allows for the projection of sequence, structure and dynamical features of protein mole-cules on three spaces of identical dimensionality and having as basic common elements the single residues that are the natural components of proteins. This allows us to approach the studied systems in an unbiased way avoiding unjustified or arbitrary assumptions.

The correspondence between network structures and ad-jacency (or covariance) matrices allows for an immediate translation of classical mathematical tools used for dealing with covariance and correlation matrixes (cluster techniques, spectral decomposition methods) into graph-based formal-isms (topological invariants, connectivity descriptors). This allows both for a cross-fertilization of different fields of in-vestigation ranging from systems biology to metabolic net-work analysis to structural biology as well as for a direct translation of potentially useful results from statistical me-chanics (based on network structures) into biological and chemical sciences. Vishveshwara and coworkers have util-ized a graph theoretical representation of protein structures along with spectral analysis techniques to study such diverse aspects as identification of backbone and sidechain clusters [63, 64], determination of quaternary association [65, 66], identification of domains [67] as well as in analyzing the stability properties of proteins [68].

Moreover a very recent finding that appeared in the lit-erature while this review was in process by the Nussinov group [69], demonstrated by means of the decomposition of protein contact matrices into modules in a way equivalent to the one described in this paper, that the inter-modular boundaries not only contain the most conserved residues of proteins but the ones most crucial for allosteric communica-tions. This “signalling” role of inter-modules residues is an important result of general network theory that was already demonstrated in other biological networks, such as gene ex-pression networks [70]. The confirmation of this “organiza-tional principle” of network architecture in the case of pro-tein topology could be one of the very few “general theoreti-cal principles” holding for biological matter.

AKNOWLEDGEMENTS

This work was supported by a joint DMS/DGMS initia-tive to support mathematical biology, from the NSF and NIH, (NSF DMS #0240230) to JPZ.

REFERENCES

[1] Frauenfelder, H. and Wolynes, P. (1994) Phys. Today, 47, 58. [2] Karplus, M. and Weaver, D.L. (1994) Protein Sci., 3, 650. [3] Karplus, M. and Sali, A. (1995) Curr. Op. Struct. Biol., 5, 58. [4] Dill, K.A., Bromberg, S., Yue, K., Fiebig, K.M. and Yee D.P.

(1995) Protein Sci., 4, 561. [5] Bryngelson, J.D. and Wolynes, P.G. (1987) Proc. Natl. Acad. Sci.

USA, 84, 7524. [6] Levitt, M. and Warshel, A. (1975) Nature, 253, 693. [7] Skolnick, J. and Kolinski, A. (1992) Science, 250, 1121. [8] Godzik, A., Skolnick, J. and Kolinski, A. (1992) Proc. Natl. Acad.

Sci. USA, 89, 2629. [9] Kolinski, A. and Skolnick J. (1994) Proteins Struct. Funct. Genet.,

18, 338. [10] Baldwin, R.L. (1975) Ann. Rev. Biochem., 44, 453. [11] Kim, P.S. and Baldwin, R.L. (1982) Ann. Rev. Biochem., 51, 459. [12] Yon, J.M. (2002) J. Cell. Mol. Med., 6, 307. [13] Zimm, B.H. and Bragg, J.K. (1959) J. Chem. Phys., 31, 526. [14] Lifson, S. and Roig, A. (1961) J. Chem. Phys., 34, 1963. [15] Jaenicke, R. (1987) Prog. Biophys. Mol. Biol., 49, 117. [16] Wetlaufer, D.B. (1973) Proc. Natl. Acad. Sci. USA, 70, 697. [17] Wolynes, P.G., Onuchic, J.N. and Thirumalai, D. (1995) Science,

267, 1618. [18] Giuliani, A., Benigni, R., Zbilut, J.P., Webber, C.L., Jr., Sirabella,

P. and Colosimo, A. (2002) Chem. Rev., 102, 1471. [19] Romero, P., Obradovic, Z., Li, X., Graner, E.C., Brown, C.J. and

Dunker A.K. (2001) Proteins Struct. Funct. Genet., 42, 38. [20] Uversky, V.N. (2002) Protein Sci., 11, 739. [21] Abkevich, V.I., Gutin, A.M. and Shakhnovic, EI. (1998) Proteins

Struct. Func. Genet., 31, 335. [22] Yates, FE. and Kugler, P.N. (1986) J. Pharm. Sci., 75, 1019. [23] Tenenbaum, J.B., De Silva, V.D. and Langford, J.C. (2000) Sci-

ence, 290, 2319. [24] Webber, C.L., Giuliani, A., Zbilut, J.P. and Colosimo, A. (2001)

Proteins: Struct. Funct. and Genet., 44, 292. [25] Amadei, A., Linssen, A.B.M. and Berendsen, H.J.C. (1993) Pro-

teins Struct. Funct. Genet., 4, 412 [26] Valerio, M.C., Colosimo, A., Conti, F., Giuliani, A., Grottesi, A.,

Manetti, C. and Zbilut, J.P. (2005) Proteins Struct. Funct. and Bio-inf., 58, 110.

[27] Colafranceschi, M., Colosimo, A., Zbilut, J.P., Uversky, V.N. and Giuliani, A. (2005) J. Chem. Inf. Model., 45, 183

[28] Palumbo, M.C., Farina, L., Colosimo, A., Tun, K., Dhar, P.K. and Giuliani, A. (2006) Curr. Bioinf., 2, 219

[29] Smolen, P., Baxter, D.A. and Byrne, G. (2000) Bull. Math. Biol., 62, 247.

[30] Gardner, T.S. and Faith, J.J. (2005) Phys. Life Rev., 2, 65. [31] Fiehn, O. and Weckwerth, W. (2003) Eur. J. Biochem., 270, 579.



[32] Stelling, J., Klamt, S., Bettenbrock, K., Schuster, S. and Gilles, E.D. (2002) Nature, 420, 190.

[33] Lassig, M., Bastolla, U., Manrubia, S.C. and Valleriani, A. (2001) Phys. Rev. Lett., 86, 4418.

[34] Sauro, M. and Kholodenko, B.N. (2004) Prog. Biophys. Mol. Biol., 86, 5.

[35] Lukovits, I. (2000) J. Chem. Inf. Comput. Sci., 40, 1147. [36] McMahon, S.M., Miller, K.H. and Drake, J. (2001) Science, 293,

1604. [37] Rao, F. and Caflisch, A. (2004) J.Mol. Biol., 342, 299. [38] Amaral, L.A.N., Scala, A., Barthelemy, M. and Stanley, H.E.

(2000) Proc. Natl. Acad. Sci USA, 97, 1149. [39] Barabasi, A.L. and Albert, R. (1999) Science, 286, 509. [40] Rackovski, S. (1998) Proc. Natl. Acad. Sci. USA, 95, 8580. [41] Zbilut, J.P., Chua, G.H., Krishnan, A., Bossa, C., Colafranceschi,

M. and Giuliani, A. (2006) FEBS Lett., 580, 4861. [42] Schwartz, R. and King, J. (2006) Protein Sci., 15, 102. [43] Lua, R.C. and Grossberg, A.Y. (2006) PloS Comput. Biol., 2, 350. [44] Compiani, M., Fariselli, P., Martelli, P.L. and Casadio, R. (1998)

Proc. Natl. Acad. Sci. USA, 95, 9290. [45] Guimera, R. and Amaral, L.A.N. (2005) Nature, 433, 895. [46] Wang, G. and Dunbrack, R.L.J. (2003) Bioinf., 19, 1589. [47] Bagler, G. and Sinha, S. (2005) Physica. A, 346, 27. [48] Kundu, S. (2005) Physica. A, 346, 104. [49] Higman, V.A. and Greene, L.H. (2006) Physica. A, 368, 595. [50] Vendruscolo, M., Dokholyan, N.V., Paci, E. and Karplus, M.

(2002) Phys. Rev. E., 65, 061910. [51] Car, R. and Parrinello, M. (1985) Phys. Rev. Lett., 55, 2471. [52] Berendsen, H.J.C., Postma, J.P.M., Van Gustereen, W.F. and Di

Nola, A. (1984) J. Chem. Phys., 81, 3684. [53] Benigni, R. and Giuliani, A. (1994) Am. J. Physiol., 266, R1697. [54] Preisendorfer , R.W. and Mobley, C.D. (1988) Principal Compo-

nent Analysis in Meteorology and Oceanography, Elsevier , Am-sterdam.

[55] Arcangeli, C., Bizzarri, A.R. and Cannistraro, S. (2001) Biophys.

Chem., 90, 45. [56] Chillemi, G., Falconi, M., Amadei, A., Zimatore, G., Desideri, A.

and Di Nola, A. (1997) Biophys. J., 73, 1007. [57] Peters, G.H., van Aalten, D.M., Edholm, O., Toxvaerd, S. and

Bywater, R. (1996) Biophys. J., 71, 2245. [58] Goh, K.I, Kahng, B. and Kim, D. (2001) Phys. Rev. E., 64, 051903. [59] Kannan, N. and Vishveshwara, S., (1999) J. Mol. Biol., 292, 441. [60] Zbilut, J.P., Colosimo, A., Conti, F., Colafranceschi, M., Manetti,

C., Valerio, M.C., Webber, C.L. and Giuliani, A. (2003) Biophys. J., 85, 3544.

[61] Zbilut, J.P., Giuliani, A., Colosimo, A., Mitchell, J.C., Cola-franceschi, M., Marwan, N., Webber, C.L. and Uversky, V. (2004) J. of Proteome Res., 3, 1243.

[62] Chiti, F., Taddei, N., Baroni, F., Capanni, C., Stefani, M., Ram-poni, G. and Dobson, C.M. (2002) Nature Struct. Biol., 9, 137.

[63] Kannan, N. and Vishveshwara, S. (1999) J. Mol. Biol., 292, 441-464

[64] Patra, S.M. and Vishveshwara, S. (2000) J. Theor. Comp. Chem., 84, 13-25

[65] Brinda, K.V., Mitra N., Surolia, A. and Vishveshwara, S., (2004) Protein Sci., 13, 1735-49

[66] Brinda, K.V., Surolia, A. and Vishveshwara, S. (2005) Biochem. J., 391, 1-15

[67] Sistla, R.K., Brinda, K.V. and Vishveshwara, S. (2005) Proteins:

Struc. Func. Bioinfo., 59, 616-626 [68] Brinda, K.V. and Vishveshwara, S. (2005) Biophys. J., 89, 4159-

4170 [69] Del Sol, A., Arauzo-Bravo, M.J., Amoros Moya, D. and Nussinov,

R. (2007) Genome Biol., 8, R92. [70] Yu, H., Kim, P.M., Sprecher, E., Trifonov, V. and Gerstein, M.

(2007) PloS Comput. Biol., 3, e59.

Received: December 04, 2006 Revised: September 19, 2007 Accepted: September 19, 2007


Documents

Protein network paper 2008