10
Computational Biology and Chemistry 59 (2015) 32–41 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem Research Article Core and peripheral connectivity based cluster analysis over PPI network Hasin A. Ahmed a , Dhruba K. Bhattacharyya a,, Jugal K. Kalita b a Tezpur University, Assam, India b University of Colorado, Colorado Springs, USA article info Article history: Received 7 April 2015 Received in revised form 31 July 2015 Accepted 18 August 2015 Available online 24 August 2015 Keywords: Protein–protein interaction network Biological network Clustering Protein complex abstract A number of methods have been proposed in the literature of protein–protein interaction (PPI) network analysis for detection of clusters in the network. Clusters are identified by these methods using various graph theoretic criteria. Most of these methods have been found time consuming due to involvement of preprocessing and post processing tasks. In addition, they do not achieve high precision and recall con- sistently and simultaneously. Moreover, the existing methods do not employ the idea of core-periphery structural pattern of protein complexes effectively to extract clusters. In this paper, we introduce a clus- tering method named CPCA based on a recent observation by researchers that a protein complex in a PPI network is arranged as a relatively dense core region and additional proteins weakly connected to the core. CPCA uses two connectivity criterion functions to identify core and peripheral regions of the cluster. To locate initial node of a cluster we introduce a measure called DNQ (Degree based Neighborhood Qualification) index that evaluates tendency of the node to be part of a cluster. CPCA performs well when compared with well-known counterparts. Along with protein complex gold standards, a co-localization dataset has also been used for validation of the results. © 2015 Elsevier Ltd. All rights reserved. 1. Introduction A protein, which is a very important biomolecule with one or multiple chains of amino acid residues, performs a large number of biological functions in a living organism (Nelson et al., 2008). A protein complex, also called a multi protein complex, is a group of protein molecules, i.e., polypeptide chains attached through non-covalent protein–protein interactions. A protein complex is responsible for a single or multiple biological functions in biological processes (Hartwell et al., 1999). Proteins bind with each other as a result of biochemical phenomena. There are a number of methods to detect the amount of interaction that exists between a pair of proteins including yeast two-hybrid screening and affinity purification coupled with mass spectrometry (Phizicky and Fields, 1995). Yeast two-hybrid screening also known as the Y2H method, detects interaction between a pair of proteins by attaching a protein to the DNA binding domain of a yeast transcription factor such as Gal4 and attaching the other protein to the activation domain of the transcription factor (Von Mering et al., 2002). The protein attached to the DNA bind- ing domain is called the bait protein and the protein attached to Corresponding author. the activation domain is called prey protein. Interaction between the proteins brings the binding domain and the activation domain closer, leading to expression of the associated gene. The presence of the gene product implies interaction between the pair of proteins. A second method extracts protein complexes with a target protein from a sample using affinity purification (Von Mering et al., 2002). Finally, mass spectrometry is used to identify the proteins under study and measure the amount of interaction. A protein–protein interaction network (PPIN) is a network or graph of nodes corre- sponding to proteins, with edges among the nodes corresponding to interaction among the proteins (Rual et al., 2005). A formal def- inition of PPIN is presented next. Definition 1. Protein–protein interaction network: A PPI net- work is defined as a graph G =(V, E) where V is the set of vertices corresponding to proteins and E is the set of edges. An edge is an unordered pair of vertices , (v i , v j )where v i , v j V , representing an interaction between the proteins corresponding to the vertices v i and v j . There are a number of public databases that collect pub- lished information about protein–protein interactions and prepare curated datasets (Lehne and Schlitt, 2009). Some such widely used databases are Biomolecular Interaction Network Database (BIND) (Bader et al., 2003), Munich Information Center for http://dx.doi.org/10.1016/j.compbiolchem.2015.08.008 1476-9271/© 2015 Elsevier Ltd. All rights reserved.

Core and peripheral connectivity based cluster analysis ...jkalita/papers/2015/AhmedHasinCBC2015.pdf · Core and peripheral connectivity based cluster analysis over PPI network Hasin

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Core and peripheral connectivity based cluster analysis ...jkalita/papers/2015/AhmedHasinCBC2015.pdf · Core and peripheral connectivity based cluster analysis over PPI network Hasin

Computational Biology and Chemistry 59 (2015) 32–41

Contents lists available at ScienceDirect

Computational Biology and Chemistry

journa l homepage: www.e lsev ier .com/ locate /compbio lchem

Research Article

Core and peripheral connectivity based cluster analysis over PPInetwork

Hasin A. Ahmeda, Dhruba K. Bhattacharyyaa,∗, Jugal K. Kalitab

a Tezpur University, Assam, Indiab University of Colorado, Colorado Springs, USA

a r t i c l e i n f o

Article history:Received 7 April 2015Received in revised form 31 July 2015Accepted 18 August 2015Available online 24 August 2015

Keywords:Protein–protein interaction networkBiological networkClusteringProtein complex

a b s t r a c t

A number of methods have been proposed in the literature of protein–protein interaction (PPI) networkanalysis for detection of clusters in the network. Clusters are identified by these methods using variousgraph theoretic criteria. Most of these methods have been found time consuming due to involvement ofpreprocessing and post processing tasks. In addition, they do not achieve high precision and recall con-sistently and simultaneously. Moreover, the existing methods do not employ the idea of core-peripherystructural pattern of protein complexes effectively to extract clusters. In this paper, we introduce a clus-tering method named CPCA based on a recent observation by researchers that a protein complex in aPPI network is arranged as a relatively dense core region and additional proteins weakly connected tothe core. CPCA uses two connectivity criterion functions to identify core and peripheral regions of thecluster. To locate initial node of a cluster we introduce a measure called DNQ (Degree based NeighborhoodQualification) index that evaluates tendency of the node to be part of a cluster. CPCA performs well whencompared with well-known counterparts. Along with protein complex gold standards, a co-localizationdataset has also been used for validation of the results.

© 2015 Elsevier Ltd. All rights reserved.

1. Introduction

A protein, which is a very important biomolecule with one ormultiple chains of amino acid residues, performs a large numberof biological functions in a living organism (Nelson et al., 2008).A protein complex, also called a multi protein complex, is a groupof protein molecules, i.e., polypeptide chains attached throughnon-covalent protein–protein interactions. A protein complex isresponsible for a single or multiple biological functions in biologicalprocesses (Hartwell et al., 1999).

Proteins bind with each other as a result of biochemicalphenomena. There are a number of methods to detect the amountof interaction that exists between a pair of proteins includingyeast two-hybrid screening and affinity purification coupled withmass spectrometry (Phizicky and Fields, 1995). Yeast two-hybridscreening also known as the Y2H method, detects interactionbetween a pair of proteins by attaching a protein to the DNA bindingdomain of a yeast transcription factor such as Gal4 and attaching theother protein to the activation domain of the transcription factor(Von Mering et al., 2002). The protein attached to the DNA bind-ing domain is called the bait protein and the protein attached to

∗ Corresponding author.

the activation domain is called prey protein. Interaction betweenthe proteins brings the binding domain and the activation domaincloser, leading to expression of the associated gene. The presence ofthe gene product implies interaction between the pair of proteins.A second method extracts protein complexes with a target proteinfrom a sample using affinity purification (Von Mering et al., 2002).Finally, mass spectrometry is used to identify the proteins understudy and measure the amount of interaction. A protein–proteininteraction network (PPIN) is a network or graph of nodes corre-sponding to proteins, with edges among the nodes correspondingto interaction among the proteins (Rual et al., 2005). A formal def-inition of PPIN is presented next.

Definition 1. Protein–protein interaction network: A PPI net-work is defined as a graph G = (V, E) where V is the set of verticescorresponding to proteins and E is the set of edges. An edge is anunordered pair of vertices , (vi, vj)where vi, vj ∈ V , representing aninteraction between the proteins corresponding to the vertices viand vj .

There are a number of public databases that collect pub-lished information about protein–protein interactions and preparecurated datasets (Lehne and Schlitt, 2009). Some such widelyused databases are Biomolecular Interaction Network Database(BIND) (Bader et al., 2003), Munich Information Center for

http://dx.doi.org/10.1016/j.compbiolchem.2015.08.0081476-9271/© 2015 Elsevier Ltd. All rights reserved.

Page 2: Core and peripheral connectivity based cluster analysis ...jkalita/papers/2015/AhmedHasinCBC2015.pdf · Core and peripheral connectivity based cluster analysis over PPI network Hasin

H.A. Ahmed et al. / Computational Biology and Chemistry 59 (2015) 32–41 33

Protein Sequences (MIPS) database (Mewes et al., 2004), BioGRID(Stark et al., 2006), Database of Interacting Proteins (DIP) (Xenarioset al., 2000), IntAct (Hermjakob et al., 2004), Molecular INTeractiondatabase (MINT) (Zanzoni et al., 2002), Human Protein ReferenceDatabase (HPRD) (Peri et al., 2003), YPD (Costanzo et al., 2001) andFuncoup (Alexeyenko et al., 2012). Some of the widely used sourcesof protein–protein interaction data for clustering are datasets ofKrogan et al. (2006), Gavin et al. (2002), Collins et al. (2007), Hoet al. (2002), Ito et al. (2001), Uetz et al. (2000), Giot et al. (2003)and Li et al. (2004).

Along with numerous PPI network repositories, there are a fewdatabases which store records of protein complexes such as MIPS(Mewes et al., 2004), SGD (Hong et al., 2008) and CYC2008 (Puet al., 2009). Computer scientists have always been assisting biolo-gists by developing mathematical models that act behind biologicalphenomena. The current state of PPI and protein complex researchin terms of experimental findings and well-managed repositories ofsuch findings present the computer scientist with a problem thatseeks to determine mathematical relationships that exist amongproteins of the same complex in a PPI network. Formally, this prob-lem can be defined as follows.

Problem Definition: Suppose a set of protein complexes is rep-resented by P, where each complex Pi ∈ P is a set of proteins. Theproblem involves designing an algorithm that accepts a PPI net-work G = (V, E) and produces a set of clusters C where each clusterCi ∈ C is a subset of V i.e., Ci ⊆ V using some graph theoretic criteriaapplied on G such that similarity between P and C is maximized.

Existing methods of clustering PPI network are reported in Sec-tion 2. Section 3 presents motivation behind the work. Section4 presents the proposed method. Experimental evaluation of themethod is presented in Section 5 and finally, Section 6 concludesthe paper.

2. Related work

A number of techniques have been proposed to extract clus-ters in a PPI network. COACH (Wu et al., 2009) exploits thecore-periphery structure of protein complexes to build clusters.It prepares protein complex cores from neighborhood graphs ofthe vertices. These cores are further extended to form completeclusters. COACH analyses the neighborhood of each vertex to formcluster cores using the concepts of neighborhood graph and corevertices. Along with the process of core creation, it ensures thatdensity of the core is more than a user defined threshold and thecores are not highly overlapped. In the second phase, more ver-tices are included in the cores if connectivity of a node is foundto be more than 0.5. This connectivity is measured as the ratio ofnumber of edges between the node and the core complex to thetotal number of nodes in the core. RFC (Wu et al., 2014) preparesa weighted network using information about neighborhood of thenodes. This network is transformed into an unweighted form usingfuzzy analysis. Then rough set theory is used to prepare lower andupper approximations of the clusters with the property that pro-teins in the boundary region may be non-exclusive and these maybelong to more than one clusters. Finally, clusters with high overlapare merged.

CORE (Leung et al., 2009) uses the core-periphery structure ofprotein complexes to extract clusters from a PPI network. It usesprobability based P-values to determine whether a pair of verticesbelongs to core region of a cluster based on direct connectivityand their common neighborhood. They progressively merge pro-tein pairs with the lowest P-value into core regions of clusters.Once core clusters are formed, the peripheral region is detectedby including proteins that are common neighbors of at least half ofthe core proteins. MCODE (Bader and Hogue, 2003) detects clusters

in a PPI network in three stages: vertex weighting, complex detec-tion and post processing of clusters. In the first stage, vertices areassigned weights using the density of a highest k-core subgraph intheir neighborhood. In the second stage, to form a cluster, it startswith the node with the highest weight and extends it by includ-ing vertices into the cluster with weight more than a user giventhreshold. In the post processing stage, clusters without two coresare removed.

CFinder (Adamcsek et al., 2006) is a tool to cluster nodes of a bio-logical network. It uses the concept of k-clique to locate clusters.Two k-cliques are considered adjacent if they share k − 1 nodes.A pair of nodes are put in the same cluster if they are reachablefrom each other through adjacent k-cliques. CMC (Liu et al., 2009)uses an iterative scoring method to assign weights to edges. Theweight between a pair of nodes is calculated based on their com-mon neighbors. The method then finds maximal cliques, whichare finally removed or merged based on their connectivity. MCL(Dongen, 2000) is a flow analysis technique used to cluster PPInetworks. The technique uses random walks to find regions withstrong flow in the graph. The technique converges towards a set ofregions with high flow that correspond to the clusters.

RNSC (King et al., 2004) partitions vertices in the PPI net-work using a cost function. It moves the vertices randomly amongthe clusters to optimize the cost function. Post processing of thepartitions is carried out using three criteria: cluster size, densityand functional homogeneity. MGClus (Frings et al., 2013) uses anagglomerative approach to combine clusters starting from single-ton clusters consisting of individual proteins. To merge clustersthey use a measure named Merge Gain which takes the numberof interior and exterior edges into account. MGClus merges twoclusters if Merge Gain for the pair of clusters is more than a userdefined threshold. GMFTP (Zhang et al., 2014) uses an affinity basedapproach that supports overlap among clusters. Affinity score ofa protein against clusters is computed using both functional andtopological similarity. Gene Ontology is used to compute functionalsimilarity while protein interaction network is used to computetopological similarity. DPC (Li et al., 2014) detects dynamic proteincomplexes using PPI data and gene expression data. Time seriesgene expression data is used to find proteins which are active overtime points to add dynamic characteristic of generated clusters.First, DPC detects highly connected core regions assisted by timeseries gene expression data. These cores are then extended to com-plete clusters.

A detailed review of techniques for detecting complexes in PPInetworks can be found in a recent survey by Ji et al. (2014). Table 1displays some widely known techniques for complex detectionalong with their distinguishing features. An issue that arises at thetime of validation of these techniques is how many common pro-teins a cluster must have to match a protein complex. This matchinginformation is then used to compute evaluation measures such asprecision, recall and sensitivity. Most techniques are found to adaptan overlap score used by Bader and Hogue (2003). If ˛ is the setof proteins in a cluster clus and ˇ is the set of proteins in a com-plex comp, the overlap score between clus and comp is computedas follows:

Overlap score(clus, comp) = |˛ ∩ ˇ|2

|˛| × |ˇ|(1)

If this overlap score is more than a threshold, the cluster isassumed to match the complex. Some work in the literature usesother schemes for comparison such as the Jaccard index. The Jac-card index between a cluster clus and a complex comp is computedas follows:

Jaccard index(clus, comp) = |˛ ∩ ˇ||˛ ∪ ˇ|

(2)

Page 3: Core and peripheral connectivity based cluster analysis ...jkalita/papers/2015/AhmedHasinCBC2015.pdf · Core and peripheral connectivity based cluster analysis over PPI network Hasin

34H

.A.Ahmed

etal./ComputationalBiology

andChem

istry59

(2015)32–41

Table 1Some well known PPI clustering techniques.

Technique, year Basic working #Inputparameters

Pre processdata

Post processresult

Datasets used Gold standards Matchingscheme

Threshold Performancemeasures

Executablegiven

COACH (Wu et al.,2009), 2009

Detects core regionand then extendsthese to completeclusters

2√ × (post

processing ofcores isembedded)

DIP, Krogan Friedel Bader 0.2 p value,Coannotation sore,Colocalizationscore, F measure,Coverage rate

RFC (Wu et al., 2014),2014

Uses a fuzzy-roughapproach forclustering proteins

2√ √

Gavin, Collins,Krogan,BioGRID

MIPS, SGD Bader 0.25 Precision,F-measure,Sensitivity,Accuracy,Separation

×

CORE (Leung et al.,2009), 2009

Uses probabilityapproach to detectcore and peripheralregions of clusters

0 × √DIP, Krogan,Gavin

MIPS Bader 0.6, 0.7, 0.8 No. of complexesdetected

×

MCODE (Bader andHogue, 2003), 2003

Combinesneighboringweighted verticesbased on athreshold

1 × (embedded) × (embedded) Gavin, MIPS,YPD

Gavin, MIPS Bader 0.2 No. of complexesdetected,Sensitivity,Specificity

CFinder (Adamcseket al., 2006), 2006

Tool to clusterbiological networksusing k-clique

1 × × N/A N/A N/A N/A N/A√

CMC (Liu et al., 2009),2009

Finds maximalcliques as whenfinding clusters in aweighted network

2√ × Ho, Gavin,

Krogan, Uetz,Ito

MIPS, Aloy Jaccard index 0.5 Precision, Recall,Co-localizationscore

×

MCL* (Dongen, 2000),2000

Uses random walkto find regions inthe graph withhigh flow

1 × × Uetz, Ito, Gavin,Ho, Krogan

MIPS N/A N/A Sensitivity, PPV,Accuracy (Differentfrom above)

N/A

RNSC (King et al.,2004), 2004

Randomly transfersproteins amongclusters tooptimize costfunction

0 (3 in postprocessing)

× √Przulj, Giot, Li MIPS King N/A Matching rate

√(upon

request)

MGClus (Frings et al.,2013), 2013

Clusters aremerged in anagglomerativemanner to formfinal clusters

1√ × BioGRID,

FunCoupCYC2008 N/A N/A Average Jackard

Index

* Original work was not meant to be applied on PPI clustering. So data presented are based on the review by Brohee and van Helden (2006).

Page 4: Core and peripheral connectivity based cluster analysis ...jkalita/papers/2015/AhmedHasinCBC2015.pdf · Core and peripheral connectivity based cluster analysis over PPI network Hasin

H.A. Ahmed et al. / Computational Biology and Chemistry 59 (2015) 32–41 35

Various matching schemes along with the thresholds used arereported in the columns named Matching scheme and Threshold,respectively, in Table 1.

3. Motivation

Many methods have been proposed in the literature addressingthe problem of clustering PPI network. These techniques mainly dif-fer in the criteria used to extract clusters from the PPI network. Thefollowing are some structural characteristics of overlapped proteincomplexes in PPI networks as observed by researchers.

• Protein complexes in a PPI network correspond to relativelydense regions (Tong et al., 2002; Bader and Hogue, 2003; Spirinand Mirny, 2003; Vanunu et al., 2010).

• A protein complex typically has two regions, viz., core and periph-eral (Gavin et al., 2006; Wu et al., 2009; Wang et al., 2010; Luoet al., 2009). The core part is a highly dense central region whereproteins are strongly connected with each other and the periph-eral region is that part of the complex where proteins are weaklyconnected with the core.

Most existing methods for PPI clustering do not use the sec-ond fact in the detection of clusters. Wu et al. (2009) proposeda method named COACH that leverages this structural propertyof protein complexes in clustering PPI nodes. The core vertex (asdefined in Wu et al. (2009)) dominates the density value of the coreregion of the cluster being detected and as a result, other elementsmay be left weakly connected when density of the core subgraphis compared with the density threshold. Another issue in COACH isway in which the core is extended to complete cluster by addingthe peripheral entry. In this phase, the expectation that peripheralproteins have 50% neighborhood overlap with the core seems quiteoptimistic. Here, we propose a technique for extracting clusters ina PPI network that takes into account the core-periphery structureof the corresponding protein complexes in a more comprehensivemanner.

4. Proposed method

Though a number of clustering methods have been proposedfor PPI network analysis, most are not consistent in generating highprecision and recall when tested with existing PPI datasets. Anotherissue is the computational expense imposed by these methods. Wepropose a method for detection of clusters in PPI network using agraph density approach. Our approach called CPCA1 is influencedby the recent observation that a complex consists of a core denseregion with some proteins weakly connected to the dense region,often called periphery (Della Rossa et al., 2013; Wu et al., 2009;Wang et al., 2010). While extracting a cluster, CPCA detects a coredense region, followed by detection of a weakly connected periph-eral region.

CPCA uses two connectivity criterion functions to identify coreand peripheral regions of the clusters viz., core connectivity andperipheral connectivity. To locate a dense vertex in the PPI network,we introduce a measure named DNQ index that computes the ten-dency of the vertex to reside at a dense region. Formal definitionsof DNQ index and the connectivity measures are presented next.

1 CPCA stands for core and peripheral connectivity based Cluster Analysis.

Definition 2. DNQ (Degree based Neighborhood Qualification)index: DNQ index of a vertex vi in a PPI network G is defined asfollows.

DNQInd(vi) = |V ′||neighbors(vi)|

(3)

Here, neighbors(vi) is the set of neighbors of vertex vi in G, and V ′ ={v : v ∈ neighbors(vi), degree(v) ≥ degree(vi)}, where degree(v) is thedegree of vertexv in G.

Definition 3. Core connectivity: In a PPI network G = (V, E), coreconnectivity between a vertex vi and a partial cluster ! where ! ⊆V − vi is defined as follows.

CoreCon(vi, !) = |!′||!|

(4)

Here, !′ = {v : v ∈ !, (v, vi) ∈ E}.

Definition 4. Peripheral connectivity: In a PPI network G = (V, E),peripheral connectivity between a vertex vi and a partial cluster !where ! ⊆ V − vi is defined as follows.

PerCon(vi, !) = |!′||neighbors(vi)|

(5)

Here, !′ = {v : v ∈ !, (v, vi) ∈ E}, and neighbors(vi) is the set of neigh-boring vertices of vi in G.

The intuition behind using core connectivity and peripheralconnectivity is guided by the structural difference between thetwo regions of a protein complex, namely core and periphery orattachment. The core region of a protein complex is denser and thenodes are highly connected among themselves, while the periph-eral region of a complex is weakly connected to the core. To reflectthis notion in our technique, we use core connectivity to includeproteins which are highly connected to the rest of the nodes. Thisfact is supported by Eq. (4) where core connectivity is computed asthe ratio of the number of edges between the node and the partialcluster to the total number of nodes in the partial cluster. To enablethe technique to include weakly connected nodes in the cluster, wetrace the proportion of neighbors of the node that are connected tothe partial cluster as presented in Eq. (5) instead of checking theproportion of nodes in the partial cluster the node is connected to.

Considering the entire set of vertices as the set of remainingnodes, CPCA starts finding a node with the highest DNQ index.This node is put in a partial cluster and additional nodes withhighest core connectivity are added to the partial cluster later.Inclusion of the nodes is driven by a user defined threshold calledNC (neighborhood and connectivity) threshold ("). This phase ofpartial cluster extension is exclusive, i.e., no node which has alreadybeen included in a cluster is chosen again for inclusion. We call thisphase of cluster detection exclusive core formation. Once this exclu-sive core is formed, additional nodes are inserted into the clusterin a non-exclusive manner. This phase of cluster detection is callednon-exclusive core formation. The only difference between exclu-sive core formation and non-exclusive core formation is that the laterphase can pick up nodes even from clusters which have alreadybeen formed. Finally, the peripheral connectivity measure is usedto include additional nodes into the partial cluster. This phase ofcluster detection is named non-exclusive periphery formation. Non-exclusive core formation and non-exclusive periphery formation arealso driven by the threshold ". This cluster detection process itera-tively takes place until there is no remaining node with DNQ indexgreater than or equal to ". A logical view of the CPCA algorithmis given in Fig. 1. If no node is included into the partial clusterduring non-exclusive core formation phase, CPCA does not moveto other two phases and drops the partial cluster. The complexityof the algorithm is O(n2). The exclusive core formation phase ofthe algorithm does not explore next candidate from all the nodes,

Page 5: Core and peripheral connectivity based cluster analysis ...jkalita/papers/2015/AhmedHasinCBC2015.pdf · Core and peripheral connectivity based cluster analysis over PPI network Hasin

36 H.A. Ahmed et al. / Computational Biology and Chemistry 59 (2015) 32–41

Fig. 1. Flowchart of CPCA algorithm.

rather it searches through only the nodes which are not included inany cluster. So amortized complexity of the algorithm is less thanO(n2). Following are two theorems on the CPCA method that ensureproper formation of core and peripheral regions in a cluster.

Theorem 1. Core extension phase of CPCA involves addition of nodeswhich are strongly connected to the partial cluster formed so far withrespect to user defined threshold ".

Proof: In the core extension phase, CPCA includes nodes which areconnected to the partial module formed so far with core connectiv-ity value more that CC threshold ("). If we look at Eq. (4), the coreconnectivity of a node with a partial cluster is defined as the ratioof number of edges between the node and the partial cluster to thetotal number nodes in the partial cluster. Driven by the threshold ",a node without sufficient edges to the partial cluster formed so faris not included in the partial cluster. Hence, a node must be strongly

connected subject to the threshold " to get included in the partialcluster during the phase of core extension. !

Theorem 2. Peripheral extension phase in CPCA involves inclusionof nodes which are weakly connected to the core region and detachedfrom rest of the nodes.

Proof: In the peripheral extension phase, CPCA decides whether anode can be included in the partial cluster based on its peripheralconnectivity with the core. Peripheral connectivity between a nodeand the core is defined as the ratio of edges between the node andthe core to the total number of edges leaving the node. As presentedin Eq. (5), the denominator in peripheral connectivity computa-tion is the total number of edges leaving the node. The value ofperipheral connectivity does not correspond to how much connec-tivity the node has with the core, instead it determines how muchthe node is detached from nodes outside the core, again based on

Page 6: Core and peripheral connectivity based cluster analysis ...jkalita/papers/2015/AhmedHasinCBC2015.pdf · Core and peripheral connectivity based cluster analysis over PPI network Hasin

H.A. Ahmed et al. / Computational Biology and Chemistry 59 (2015) 32–41 37

the threshold ". Nodes included in the peripheral extension phasecannot be strongly connected as the peripheral extension phaseis carried out after the core extension phase. Termination of thecore extension phase ensures that no node is connected to the corestrongly in terms of core connectivity with respect to threshold ".!

5. Experimental results

We implement CPCA in Matlab and executed it on an HP Z800workstation with two Intel Xeon 2.40 processors and 12 GB RAM.The executable of CPCA can be found at http://agnigarh.tezu.ernet.in/∼dkb/CPCA/. We use two different sources of knowledge to eval-uate quality of the clusters, viz., protein complex gold standarddatasets (Mewes et al., 2004; Hong et al., 2008) and the proteinco-localization dataset (Kumar et al., 2002). We compare results ofthe proposed technique with five network clustering techniques,namely MCODE (Bader and Hogue, 2003), COACH (Wu et al., 2009),ClusterONE (Nepusz et al., 2012), Affinity propagation clustering(Frey and Dueck, 2007) and SCPS (Nepusz et al., 2010). Implementa-tions given by authors and the ClusterMaker package (Morris et al.,2011) are used for these techniques. We use PPI datasets of Gavinet al. (2002), Ho et al. (2002), Ito et al. (2001), Uetz et al. (2000) andVon Mering et al. (2002) to test the quality of clusters.

5.1. Validation using protein complex gold standard datasets

We use MIPS (Mewes et al., 2004), SGD (Hong et al., 2008) andCYC2008 as (Pu et al., 2009) gold standard for evaluation of clusters.Different matching schemes have been used in protein–proteincluster analysis as reported in Section 1. We first adopt the schemeof Brohee and van Helden (2006) where they use sensitivity, PPV(Positive Predictive Value) and accuracy to evaluate clusteringresults. Sensitivity of a clustering solution C with clusters c1, c2,. . ., cn is computed as follows:

Sensitivity =

!ci∈C |ci| × Sensci!

ci∈C |ci|(6)

Here Sensciis the highest overlap score of cluster ci with any

complex in the gold standard, where overlap score between a clus-ter and a complex is the ratio of the number of common proteinsto the total number of proteins in the complex. Similarly, PPV forclustering solution C can be computed as follows:

PPV =

!ci∈C |ci| × PPVci!

ci∈C |ci|(7)

Here PPVciis the highest overlap score of cluster ci with any com-

plex in the gold standard, where the overlap score between a clusterand a protein complex, unlike that of sensitivity computation, isthe ratio of the number of common proteins to the total num-ber of proteins in the cluster. Accuracy of a clustering solution isdefined as the geometric mean of sensitivity and PPV. The PRO-COPE (Krumsiek et al., 2008) tool is used to evaluate effectivenessof produced clusters in terms of accuracy.

We have carried out an analysis of the results produced by CPCAto decide on the value of input parameter Neighborhood and connec-tivity threshold " on all datasets in terms of accuracy. From Figs. 2–8,we observe that CPCA exhibits good performance for " = 0.7 acrossall datasets. All evaluations reported next are carried out with thevalue of " set to 0.7.

Fig. 5 shows a comparison of CPCA with its counterparts in termsof accuracy. On Gavin et al.’s dataset CPCA stands very close to clus-terONE and SCPS, the best performers. On Ho et al.’s dataset, CPCAbeats ClusterONE with a very close margin. CPCA outperforms its

Fig. 2. Tuning " using MIPS gold standard.

Fig. 3. Tuning " using SGD gold standard.

counterparts on Uetz et al. and Ito et al.’s datasets. On Mering et al’sdataset, CPCA lies close to ClusterONE, the best performer. Clus-terONE seems to be quite close to CPCA and performs well overdatasets of Gavin et al., Ho et al. and Mering et al. But it lies behindCPCA on the datasets of Uetz et al. and Ito et al. SCPS is found to

Fig. 4. Tuning " using CYC2008 gold standard.

Page 7: Core and peripheral connectivity based cluster analysis ...jkalita/papers/2015/AhmedHasinCBC2015.pdf · Core and peripheral connectivity based cluster analysis over PPI network Hasin

38 H.A. Ahmed et al. / Computational Biology and Chemistry 59 (2015) 32–41

Fig. 5. Comparison using MIPS gold standard. Program crashed while running SCPSon Ito et al.’s dataset.

Fig. 6. Comparison using SGD gold standard. Program crashed while running SCPSon Ito et al.’s dataset.

Fig. 7. Comparison using CYC2008 gold standard. Program crashed while runningSCPS on Ito et al.’s dataset.

Fig. 8. Average accuracy of CPCA and its counterparts on all datasets.

perform well on the datasets of Gavin et al., Uetz et al. and Mer-ing et al. But it fails to achieve relatively good accuracy with theremaining two datasets. COACH is found to achieve good accuracyonly on Gavin et al.’s dataset.

A similar scenario is observed when comparing CPCA with itscounterparts using SGD as gold standard as presented in Fig. 6.While CPCA beats all on Uetz et al. and Ito et al.’s datasets, it standsclose to clusterONE, the winner on rest of the datasets. ClusterONEis again found to fail to achieve relatively good accuracy on thedatasets of Uetz et al. and Ito et al. SCPS is also found to achievegood accuracy on datasets of Gavin et al., Uetz et al. and Meringet al. Thus, CPCA achieves consistently good accuracy on all thedatasets.

On CYC2008 dataset, as reported in Fig. 8, CPCA generates bestaccuracy over all the datasets except Gavin et al.’s dataset. Fig. 8presents average accuracy of CPCA and its counterparts on alldatasets. In terms of average accuracy, CPCA beats all its contendersover all datasets.

We also use overlap score (as reported in Eq. (1)) based matchingscheme that requires a threshold value (Bader and Hogue, 2003).Based on whether the overlap score is more than the threshold wesay whether a cluster matches a protein complex. Using overlapthreshold value of 0.2 (as used by most of the existing methodsas reported in Table 1), we compute F-measure for set of clustersgenerated by CPCA and it’s counterparts. For a clustering solution Cwith clusters c1, c2, . . ., cn and gold standard P with complexes p1,p2, . . ., pm, precision can be computed as follows (Wu et al., 2014).

Precision =|{ci|ci ∈ C, ∃pj ∈ P, Overlap score(ci, pj) ≥ ˛}|

|C|(8)

Here, Overlap score(ci, pj) is the overlap score between clusterci and complex pj as reported in Eq. 1. ˛ is the overlap thresholdvalue. Similarly, recall can be computed as follows.

Recall =|{pi|pi ∈ P, ∃cj ∈ C, Overlap score(pi, cj) ≥ ˛}|

|P|(9)

F-measure is harmonic mean of precision and recall and can becomputed as follows.

F − measure = 2 × Precision × RecallPrecision + Recall

(10)

Fig. 9 reports F-measure generated by CPCA and it’s counterpartsusing MIPS gold standard. Except Ho et al.’s dataset, CPCA is foundto lie close to the best performer. CPCA generates best F-measurevalues on Uetz et al.’s, Ito et al.’s and Mering et al.’s datasets while

Page 8: Core and peripheral connectivity based cluster analysis ...jkalita/papers/2015/AhmedHasinCBC2015.pdf · Core and peripheral connectivity based cluster analysis over PPI network Hasin

H.A. Ahmed et al. / Computational Biology and Chemistry 59 (2015) 32–41 39

Fig. 9. Comparison using MIPS gold standard. Program crashed while running SCPSon Ito et al.’s dataset.

Fig. 10. Comparison using SGD gold standard. Program crashed while running SCPSon Ito et al.’s dataset.

using SGD as the gold standard as reported in Fig. 10. A similarbehavior was observed by CPCA while CYC2008 dataset is used asgold standard as presented in Fig. 11. Fig. 12 presents average F-measure value of CPCA and its counterparts on all datasets. In termsof average F-measure value, CPCA beats all its contenders whenCYC2008 and SGD dataset is used, but loses to ClusterONE whenMIPS dataset is used as gold standard.

5.2. Validation using co-localization datasets

The protein localization dataset of Kumar et al. (2002) is usedto evaluate the quality of clusters produced by CPCA and its coun-terparts. We use co-localization PPV used by Pu et al. (2007) forevaluating how proteins belonging to the same clusters are co-localized. Co-localization PPV of a cluster ci in clustering solutionC with respect to localization categories l1,l2,. . .,ln is computed asfollows:

co localization PPVci=

maxjCommonci,lj!nk=1Commonci,lk

(11)

Here common(ci, lj) is the number of common proteins betweencluster ci and localization category lj. Co-localization PPV of the

Fig. 11. Comparison using CYC2008 gold standard. Program crashed while runningSCPS on Ito et al.’s dataset.

Fig. 12. Average F-measure value of CPCA and its counterparts on all datasets.

Fig. 13. Comparison using co-localization PPV score.

entire clustering solution C is computed as the weighted average ofthese cluster values.

Fig. 13 presents co-localization PPV scores of the clusters pro-duced by CPCA and its counterparts on all datasets. CPCA is found

Page 9: Core and peripheral connectivity based cluster analysis ...jkalita/papers/2015/AhmedHasinCBC2015.pdf · Core and peripheral connectivity based cluster analysis over PPI network Hasin

40 H.A. Ahmed et al. / Computational Biology and Chemistry 59 (2015) 32–41

to achieve the highest co-localization PPV on Gavin et al. and Hoet al.’s. datasets. On Uetz et al.’s dataset, CPCA is very close to Affin-ity Propagation clustering, the best performer. Though CPCA fails toachieve a relatively high co-localization PPV on Ito et al.’s dataset,it stands close to COACH and ClusterONE, the winners. COACH isfound to consistently achieve good accuracy over all the datasets.CPCA along with ClusterONE achieves good accuracy over all thedatasets except the dataset of Ito et al. Affinity Propagation cluster-ing is also found to achieve good accuracy over datasets of Ho et al.,Uetz et al. and Ito et al.

6. Conclusion

We propose a technique for detection of clusters that cor-respond to complexes in a PPI network. We have formulatedtwo criteria to form densely connected core and weakly con-nected peripheral regions of clusters. The correspondence betweenclusters produced from five benchmark PPI datsets and proteincomplexes is evaluated in terms of accuracy and F-measure usingMIPS and SGD protein complex datasets. CPCA has also been eval-uated using a co-localization dataset. CPCA performs consistentlywell across all the datasets being the best or very close to the bestin most of the cases.

Acknowledgements

This work is supported by Department of Science and Technol-ogy, Government of India through INSPIRE program under AORC(Assured Opportunity for Research Careers) scheme.

Appendix A. Supplementary Data

Supplementary data associated with this article can be found,in the online version, at http://dx.doi.org/10.1016/j.compbiolchem.2015.08.008.

References

Adamcsek, B., Palla, G., Farkas, I.J., Derényi, I., Vicsek, T., 2006. Cfinder: locatingcliques and overlapping modules in biological networks. Bioinformatics 22 (8),1021–1023.

Alexeyenko, A., Schmitt, T., Tjärnberg, A., Guala, D., Frings, O., Sonnhammer, E.L.,2012. Comparative interactomics with funcoup 2.0. Nucl. Acids Res. 40 (D1),D821–D828.

Bader, G.D., Hogue, C.W., 2003. An automated method for finding molecular com-plexes in large protein interaction networks. BMC Bioinformatics 4 (1), 2.

Bader, G.D., Betel, D., Hogue, C.W., 2003. Bind: the biomolecular interaction networkdatabase. Nucl. Acids Res. 31 (1), 248–250.

Brohee, S., van Helden, J., 2006. Evaluation of clustering algorithms forprotein–protein interaction networks. BMC Bioinformatics 7 (1), 488.

Collins, S.R., Kemmeren, P., Zhao, X.-C., Greenblatt, J.F., Spencer, F., Holstege, F.C.,Weissman, J.S., Krogan, N.J., 2007. Toward a comprehensive atlas of the physicalinteractome of Saccharomyces cerevisiae. Mol. Cell. Proteomics 6 (3), 439–450.

Costanzo, M.C., Crawford, M.E., Hirschman, J.E., Kranz, J.E., Olsen, P., Robertson, L.S.,Skrzypek, M.S., Braun, B.R., Hopkins, K.L., Kondu, P., et al., 2001. YPD, PombePD,and WormPD: model organism volumes of the bioknowledge library an inte-grated resource for protein information. Nucl. Acids Res. 29 (1), 75–79.

Della Rossa, F., Dercole, F., Piccardi, C., 2013. Profiling core-periphery network struc-ture by random walkers. Scientific Reports 3.

Dongen, S.M., 2000. Graph Clustering by Flow Simulation.Frey, B.J., Dueck, D., 2007. Clustering by passing messages between data points.

Science 315 (5814), 972–976.Frings, O., Alexeyenko, A., Sonnhammer, E.L., 2013. MGclus: network clustering

employing shared neighbors. Mol. Biosyst. 9 (7), 1670–1675.Gavin, A.-C., Bösche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J.,

Rick, J.M., Michon, A.-M., Cruciat, C.-M., et al., 2002. Functional organization ofthe yeast proteome by systematic analysis of protein complexes. Nature 415(6868), 141–147.

Gavin, A.-C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C., Jensen,L.J., Bastuck, S., Dümpelfeld, B., et al., 2006. Proteome survey reveals modularityof the yeast cell machinery. Nature 440 (7084), 631–636.

Giot, L., Bader, J.S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y., Ooi, C., Godwin,B., Vitols, E., et al., 2003. A protein interaction map of Drosophila melanogaster.Science 302 (5651), 1727–1736.

Hartwell, L.H., Hopfield, J.J., Leibler, S., Murray, A.W., 1999. From molecular to mod-ular cell biology. Nature 402, C47–C52.

Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard,S., Vingron, M., Roechert, B., Roepstorff, P., Valencia, A., et al., 2004. Intact:an open source molecular interaction database. Nucl. Acids Res. 32 (Suppl. 1),D452–D455.

Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.-L., Millar, A., Taylor, P.,Bennett, K., Boutilier, K., et al., 2002. Systematic identification of protein com-plexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415 (6868),180–183.

Hong, E.L., Balakrishnan, R., Dong, Q., Christie, K.R., Park, J., Binkley, G., Costanzo,M.C., Dwight, S.S., Engel, S.R., Fisk, D.G., et al., 2008. Gene ontology annotationsat sgd: new data sources and annotation methods. Nucl. Acids Res. 36 (Suppl.1), D577–D581.

Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., Sakaki, Y., 2001. A comprehensivetwo-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad.Sci. U. S. A. 98 (8), 4569–4574.

Ji, J., Zhang, A., Liu, C., Quan, X., Liu, Z., 2014. Survey: Functional module detectionfrom protein–protein interaction networks. IEEE Trans. Knowl. Data Eng. 26 (2),261–277.

King, A.D., Przulj, N., Jurisica, I., 2004. Protein complex prediction via cost-basedclustering. Bioinformatics 20 (17), 3013–3020.

Krogan, N.J., Cagney, G., Yu, H., Zhong, G., Guo, X., Ignatchenko, A., Li, J., Pu, S., Datta,N., Tikuisis, A.P., et al., 2006. Global landscape of protein complexes in the yeastSaccharomyces cerevisiae. Nature 440 (7084), 637–643.

Krumsiek, J., Friedel, C.C., Zimmer, R., 2008. Procopeprotein complex prediction andevaluation. Bioinformatics 24 (18), 2115–2116.

Kumar, A., Agarwal, S., Heyman, J.A., Matson, S., Heidtman, M., Piccirillo, S., Umansky,L., Drawid, A., Jansen, R., Liu, Y., et al., 2002. Subcellular localization of the yeastproteome. Genes Dev. 16 (6), 707–719.

Lehne, B., Schlitt, T., 2009. Protein–protein interaction databases: keeping up withgrowing interactomes. Hum. Genomics 3 (3), 291.

Leung, H.C., Xiang, Q., Yiu, S.-M., Chin, F.Y., 2009. Predicting protein com-plexes from PPI data: a core-attachment approach. J. Comput. Biol. 16 (2),133–144.

Li, S., Armstrong, C.M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain, P.-O., Han,J.-D.J., Chesneau, A., Hao, T., et al., 2004. A map of the interactome network ofthe metazoan C. elegans. Science 303 (5657), 540–543.

Li, M., Chen, W., Wang, J., Wu, F.-X., Pan, Y., 2014. Identifying dynamic proteincomplexes based on gene expression profiles and ppi networks. BioMed Res.Int.

Liu, G., Wong, L., Chua, H.N., 2009. Complex discovery from weighted ppi networks.Bioinformatics 25 (15), 1891–1897.

Luo, F., Li, B., Wan, X.-F., Scheuermann, R.H., 2009. Core and periphery structures inprotein interaction networks. BMC Bioinformatics 10 (Suppl. 4), S8.

Mewes, H.-W., Amid, C., Arnold, R., Frishman, D., Güldener, U., Mannhaupt, G., Mün-sterkötter, M., Pagel, P., Strack, N., Stümpflen, V., et al., 2004. Mips: analysisand annotation of proteins from whole genomes. Nucl. Acids Res. 32 (Suppl. 1),D41–D44.

Morris, J.H., Apeltsin, L., Newman, A.M., Baumbach, J., Wittkop, T., Su, G., Bader,G.D., Ferrin, T.E., 2011. clusterMaker: a multi-algorithm clustering plugin forcytoscape. BMC Bioinformatics 12 (1), 436.

Nelson, D.L., Lehninger, A.L., Cox, M.M., 2008. Lehninger Principles of Biochemistry.Macmillan.

Nepusz, T., Sasidharan, R., Paccanaro, A., 2010. SCPS: a fast implementation of aspectral method for detecting protein families on a genome-wide scale. BMCBioinformatics 11 (1), 120.

Nepusz, T., Yu, H., Paccanaro, A., 2012. Detecting overlapping protein complexes inprotein–protein interaction networks. Nat. Methods 9 (5), 471–472.

Peri, S., Navarro, J.D., Amanchy, R., Kristiansen, T.Z., Jonnalagadda, C.K., Suren-dranath, V., Niranjan, V., Muthusamy, B., Gandhi, T., Gronborg, M., et al.,2003. Development of human protein reference database as an initial plat-form for approaching systems biology in humans. Genome Res. 13 (10),2363–2371.

Phizicky, E.M., Fields, S., 1995. Protein–protein interactions: methods for detectionand analysis. Microbiol. Rev. 59 (1), 94–123.

Pu, S., Vlasblom, J., Emili, A., Greenblatt, J., Wodak, S.J., 2007. Identifying functionalmodules in the physical interactome of Saccharomyces cerevisiae. Proteomics 7(6), 944–960.

Pu, S., Wong, J., Turner, B., Cho, E., Wodak, S.J., 2009. Up-to-date catalogues of yeastprotein complexes. Nucl. Acids Res. 37 (3), 825–831.

Rual, J.-F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N., Berriz,G.F., Gibbons, F.D., Dreze, M., Ayivi-Guedehoussou, N., et al., 2005. Towards aproteome-scale map of the human protein–protein interaction network. Nature437 (7062), 1173–1178.

Spirin, V., Mirny, L.A., 2003. Protein complexes and functional modules in molecularnetworks. Proc. Natl. Acad. Sci. U. S. A. 100 (21), 12123–12128.

Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M., 2006.Biogrid: a general repository for interaction datasets. Nucl. Acids Res. 34 (Suppl.1), D535–D539.

Tong, A.H.Y., Drees, B., Nardelli, G., Bader, G.D., Brannetti, B., Castagnoli, L., Evange-lista, M., Ferracuti, S., Nelson, B., Paoluzi, S., et al., 2002. A combined experimentaland computational strategy to define protein interaction networks for peptiderecognition modules. Science 295 (5553), 321–324.

Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D.,Narayan, V., Srinivasan, M., Pochart, P., et al., 2000. A comprehensive analysis

Page 10: Core and peripheral connectivity based cluster analysis ...jkalita/papers/2015/AhmedHasinCBC2015.pdf · Core and peripheral connectivity based cluster analysis over PPI network Hasin

H.A. Ahmed et al. / Computational Biology and Chemistry 59 (2015) 32–41 41

of protein–protein interactions in Saccharomyces cerevisiae. Nature 403 (6770),623–627.

Vanunu, O., Magger, O., Ruppin, E., Shlomi, T., Sharan, R., 2010. Associating genes andprotein complexes with disease via network propagation. PLoS Comput. Biol. 6(1), e1000641.

Von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., Bork, P., 2002.Comparative assessment of large-scale data sets of protein–protein interactions.Nature 417 (6887), 399–403.

Wang, J., Li, M., Deng, Y., Pan, Y., 2010. Recent advances in clustering methods forprotein interaction networks. BMC Genomics 11 (Suppl. 3), S10.

Wu, M., Li, X., Kwoh, C.-K., Ng, S.-K., 2009. A core-attachment based method to detectprotein complexes in PPI networks. BMC Bioinformatics 10 (1), 169.

Wu, H., Gao, L., Dong, J., Yang, X., 2014. Detecting overlapping protein complexesby rough-fuzzy clustering in protein–protein interaction networks. PLOS ONE 9(3), e91856.

Xenarios, I., Rice, D.W., Salwinski, L., Baron, M.K., Marcotte, E.M., Eisenberg, D., 2000.DIP: the database of interacting proteins. Nucl. Acids Res. 28 (1), 289–291.

Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., Ausiello, G., Helmer-Citterich, M.,Cesareni, G., 2002. MINT: a molecular interaction database. FEBS Lett. 513 (1),135–140.

Zhang, X.-F., Dai, D.-Q., Ou-Yang, L., Yan, H., 2014. Detecting overlapping proteincomplexes based on a generative model with functional and topological prop-erties. BMC Bioinformatics 15 (1), 186.