10
260 Genome Informatics 16(2): 260–269 (2005) Interaction Graph Mining for Protein Complexes Using Local Clique Merging Xiao-Li Li 1 Soon-Heng Tan 1, 2 [email protected] [email protected] Chuan-Sheng Foo 1, 3 See-Kiong Ng 1 [email protected] [email protected] 1 Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 2 School of Computing, National University of Singapore, Singapore 119260 3 Computer Science Department, Stanford University, Stanford CA 94305-9025 USA Abstract While recent technological advances have made available large datasets of experimentally- detected pairwise protein-protein interactions, there is still a lack of experimentally-determined protein complex data. To make up for this lack of protein complex data, we explore the mining of existing protein interaction graphs for protein complexes. This paper proposes a novel graph mining algorithm to detect the dense neighborhoods (highly connected regions) in an interaction graph which may correspond to protein complexes. Our algorithm first locates local cliques for each graph vertex (protein) and then merge the detected local cliques according to their affinity to form maximal dense regions. We present experimental results with yeast protein interaction data to demonstrate the effectiveness of our proposed method. Compared with other existing tech- niques, our predicted complexes can match or overlap significantly better with the known protein complexes in the MIPS benchmark database. Novel protein complexes were also predicted to help biologists in their search for new protein complexes. Keywords: protein complex, protein interaction graph, local clique, merging 1 Introduction Biological processes in the cell are mostly undertaken by complex interactions between protein molecules. Many involved bio-molecular entities called protein complexes—these are molecular aggregations of proteins assembled from multiple (typically non-pairwise) stable protein-protein interactions. Protein complexes can vary widely in size, and they play crucial roles in many cellular processes. Prominent examples include the ribosomes for protein biosynthesis, the proteasomes for breaking down proteins, and the nuclear pore complexes for regulating proteins passing through the nuclear membrane. With individual proteins as molecular units, protein complexes correspond to higher-order functional units in biological processes. In fact, many proteins are functional only after they are assembled into a complex. Elucidating such protein complexes is therefore an important research focus in cell and molecular biology. However, despite recent advances in protein interaction detection technologies, only a very small subset of the many possible protein complexes has been experimentally determined [14]. On the other hand, many high throughput experimental techniques (e.g. yeast-two-hybrid) have enabled the detection of pairwise protein-protein interactions en masse. Large sets of interaction data are now readily available in public databases for data mining and knowledge discovery such as mining potential domain-domain interactions [12], interacting motifs [16], as well as protein complexes [1, 10]. Typically, protein interaction data are modeled as an undirected graph where the vertices rep- resent unique proteins and edges denote interactions between two proteins. Previous work by Tong

Interaction Graph Mining for Protein Complexes Using Local Clique

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Interaction Graph Mining for Protein Complexes Using Local Clique

260 Genome Informatics 16(2): 260–269 (2005)

Interaction Graph Mining for Protein Complexes Using

Local Clique Merging

Xiao-Li Li1 Soon-Heng Tan1,2

[email protected] [email protected]

Chuan-Sheng Foo1,3 See-Kiong Ng1

[email protected] [email protected] Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 1196132 School of Computing, National University of Singapore, Singapore 1192603 Computer Science Department, Stanford University, Stanford CA 94305-9025 USA

Abstract

While recent technological advances have made available large datasets of experimentally-detected pairwise protein-protein interactions, there is still a lack of experimentally-determinedprotein complex data. To make up for this lack of protein complex data, we explore the miningof existing protein interaction graphs for protein complexes. This paper proposes a novel graphmining algorithm to detect the dense neighborhoods (highly connected regions) in an interactiongraph which may correspond to protein complexes. Our algorithm first locates local cliques foreach graph vertex (protein) and then merge the detected local cliques according to their affinityto form maximal dense regions. We present experimental results with yeast protein interactiondata to demonstrate the effectiveness of our proposed method. Compared with other existing tech-niques, our predicted complexes can match or overlap significantly better with the known proteincomplexes in the MIPS benchmark database. Novel protein complexes were also predicted to helpbiologists in their search for new protein complexes.

Keywords: protein complex, protein interaction graph, local clique, merging

1 Introduction

Biological processes in the cell are mostly undertaken by complex interactions between protein molecules.Many involved bio-molecular entities called protein complexes—these are molecular aggregations ofproteins assembled from multiple (typically non-pairwise) stable protein-protein interactions. Proteincomplexes can vary widely in size, and they play crucial roles in many cellular processes. Prominentexamples include the ribosomes for protein biosynthesis, the proteasomes for breaking down proteins,and the nuclear pore complexes for regulating proteins passing through the nuclear membrane. Withindividual proteins as molecular units, protein complexes correspond to higher-order functional unitsin biological processes. In fact, many proteins are functional only after they are assembled into acomplex. Elucidating such protein complexes is therefore an important research focus in cell andmolecular biology.

However, despite recent advances in protein interaction detection technologies, only a very smallsubset of the many possible protein complexes has been experimentally determined [14]. On theother hand, many high throughput experimental techniques (e.g. yeast-two-hybrid) have enabled thedetection of pairwise protein-protein interactions en masse. Large sets of interaction data are nowreadily available in public databases for data mining and knowledge discovery such as mining potentialdomain-domain interactions [12], interacting motifs [16], as well as protein complexes [1, 10].

Typically, protein interaction data are modeled as an undirected graph where the vertices rep-resent unique proteins and edges denote interactions between two proteins. Previous work by Tong

Page 2: Interaction Graph Mining for Protein Complexes Using Local Clique

Interaction Graph Mining for Protein Complexes 261

and Drees [17] has revealed that protein complexes generally correspond to dense regions (densesub-graphs) or even cliques (fully connected subgraphs) in proteininteraction graphs. In fact, Spirin andMirny [15] attempted to detect protein complexes and functional modules by finding cliques in proteininteraction networks. However, their approach is limited by the use of cliques which can be problem-atic given that current available interaction data is incomplete. Other than clique-mining approaches,several clustering algorithms have also been proposed to identify the dense regions in a given graph bypartitioning graphs into disjoint clusters [6, 19]. However, as they require that each vertex (protein)belongs to one specific cluster, these algorithms are not suitable for finding complexes in interactiongraphs because a protein may be involved in multiple complexes.

Recently, Bader and Hogue [1] proposed the MCODE algorithm that utilizes connectivity valuesin protein interaction graphs to mine for protein complexes. The algorithm first computes the vertexweighting (vertex weighting step) from its neighbor density and then traverses outward from a seedprotein with a high weighting value (complex prediction step) to recursively include neighboring ver-tices whose weights are above a given threshold. However, since the highly weighted vertices may notbe highly connected to each other, the algorithm does not guarantee that the discovered regions aredense. In fact, in the post preprocessing step of the MCODE algorithm, there was a need to filter for2-core (graph of minimum degree 2), which means that the vertices in the discovered regions were notalways dense or even at least 2-degrees.

More recently, King et al. [10] proposed to use a restricted neighborhoods search clustering algo-rithm (RNSC) to predict protein complexes by partitioning the protein-protein interaction networkusing a cost function. However, like many clustering algorithms, their results depended heavily on thequality of the initial clustering which is random or user-defined. Also, relatively fewer complexes werepredicted by this algorithm.

In this paper, we propose an effective and efficient algorithm to predict protein complexes fromprotein-protein interaction graphs using local clique merging. First, we locate local cliques in aninteraction graph using our proposed polynomial time algorithm. Our algorithm then merges the localcliques into bigger dense graphs for protein complex identification. Note that we predict complexesbased on dense graphs rather than cliques. As there are no requirements for the dense graphs to befully connected, our algorithm is less sensitive to incomplete protein interaction data than conventionalclique detection methods. We evaluated our method using yeast protein interaction data and foundthat the F-measures of the protein complexes predicted by our approach were significantly higher thanthose detected by other computational techniques.

2 Method

In this paper, we propose to mine an interaction graph Gppi for protein complexes using a Local CliqueMerging Algorithm (LCMA) to identify maximal dense subgraphs in Gppi for protein complexes. Letus define the concept of a dense graph based on the clustering coefficient of the graph. Suppose wehave a subgraph G′ = (V ′, E′), where V ′ is a subset of vertices (proteins) and E′ a subset of edges(interactions) from Gppi. Theoretically, the maximum number of edges for the undirected graph is|V ′| ∗ (|V ′| − 1)/2. We define the density (cluster coefficient) of a graph as follows.

Definition 1 The density of the graph G′ = (V ′, E′) is defined as:

cc(G′) =|E′|

|V ′| ∗ |V ′ − 1|/2=

2 ∗ |E′||V ′| ∗ |V ′ − 1|

. (1)

Basically, 0 ≤ cc(G′) ≤ 1. If cc(G′) = 1, then G′ is the fully connected graph or a clique, whichhas the maximum number of edges.

Our LCMA algorithm consists of two basic steps to find maximal dense subgraphs in Gppi. Thefirst step computes the local cliques for all the vertices in Gppi. This is based on the observation that

Page 3: Interaction Graph Mining for Protein Complexes Using Local Clique

262 Li et al.

a maximal dense region covering vertices {v1, . . . vk} in Gppi must necessarily contain the local cliques(if any) of the vertices from {v1, . . . vk}. Then, in the second step, we merge these local cliques toform maximal dense graphs.

2.1 Mining for Local Cliques

The first step of our LCMA algorithm is to find the local cliques in the graph Gppi. For each vertexfrom graph Gppi, we first get its initial local neighborhood graph - namely, vi, all its neighbors andthe edges between the neighbors in graph Gppi.

Definition 2 Let a graph G = (V,E). For each vertex vi ∈ V , its local neighborhood graph Gvi =(Vvi , Evi), where Vvi = {vi} ∪ {v|v ∈ V, (v, vi) ∈ E}, Evi = {(vj , vk)|(vj , vk) ∈ E, vj , vk ∈ Vvi}.

Figure 1: An exampleof mining a local cliquefrom a local neighbor-hood graph.

Our proposed LCMA algorithm then graphically uncovers the localcliques for each vertex by iteratively removing loosely connected neighbor-ing vertices. This is an iterative process that stops when the density (clustercoefficient) of the local neighborhoods cannot be increased further. Figure 1shows an example of mining the local cliques for a vertex 1. In the graph, thevertex 1’s neighbors are: 6, 7, 2, 4, 5, and 3. Note we sort all the neighborsaccording to their degrees in the vertex’s local neighborhood graph. Thenwe iteratively remove a neighbor vertex if it increases the density—here, theneighbors 6, 7, 2 are removed correspondingly. This results in the final localclique shown in the circled area, which is a clique that consists of vertex setD = {1, 3, 4, 5} and density cc(D) = 1 (|V | = 4, and |E| = 6).

The details of our LCMA algorithm to mine local cliques (LC) fromlocal neighborhood graphs are in Algorithm 1. In the algorithm, steps 7 to20 comprise the main loop. The algorithm first computes the densities λ forAL(v)

⋃{v}. Then, it iteratively removes the loosely connected vertices until

the density of the local neighborhood graph does not increase. In particular,step 11 finds the vertex with the least degree in v’s neighborhood and then checks if it should beremoved. If the density of the graph is increased after its removal (step 12), only then it is removedin step 13, followed by steps 14 and 15 to update the vertex degrees in the local neighborhood graphand assign the new density to the refined graph. Otherwise, there are no loosely connected vertices toremove and we stop the removing process. Finally, steps 21 to 25 remove all cliques consisting of only2 vertices. This filter ensures that the our discovered graph is dense (fully connected) and contains atleast three proteins.

The time complexity of this local clique finding step is O(m ∗ log(m)), where m is the number ofedges in graph Gppi. The main time cost is mainly for building the sorted adjacency lists (steps 3 to6). However, the subsequent iterative removal of the loosely connected vertices can be performed inlinear time after we have sorted the adjacency lists.

Next, we show that each local dense neighborhood in the output set LC is a clique.

Theorem 1 Every graph in LC is a clique.

Proof: For each vertex v ∈ Gppi, our algorithm first finds the vertex v′ with the least degree inv’s local neighborhood graph Gv = (Vv, Ev), i.e.

v′ = arg minvi

deg(vi), vi ∈ Vv. (2)

The inner loop (steps 10 to 19) of the algorithm terminates when the removal of the vertex v′ doesnot result in an increase in the density of the resultant graph. This condition is stated in equation 3.

2|Ev||Vv|(|Vv| − 1)

≥ 2(|Ev| − deg(v′))(|Vv| − 1)(|Vv| − 2|)

. (3)

Page 4: Interaction Graph Mining for Protein Complexes Using Local Clique

Interaction Graph Mining for Protein Complexes 263

Algorithm 1 LCMA algorithm(step 1):Mining local cliques LC from local neighborhood graphs1: BEGIN2: Set LC = ∅;3: for each vertex v ∈ Gppi do4: Construct v’s Adjacency List AL(v);5: Sort AL(v) according to their degree in v’s neighborhood;6: end for7: for each vertex v ∈ Gppi do8: Compute the density (clustering coefficient) λ = cc(AL(v) ∪ {v});9: Stop = False;

10: repeat11: Find the vertex v′ with minimum degree in AL(v), v′ = arg min

vi

d(vi);

12: if (cc(AL(v) ∪ {v} − {v′}) > λ) then13: remove v′ from AL(v);14: update the degree for vertices that connected with v′ in AL(v);15: λ = cc(AL(v) ∪ {v} − {v′});16: else17: Stop = True;18: end if19: until (Stop = True)20: end for21: for each graph g ∈ LC do22: if |g| ≤ 2 then23: LC = LC − {g};24: end if25: end for26: END

Solving equation 3 for |Ev| yields

|Ev| ≤deg(v′)|Vv|

2. (4)

Since each edge connects two vertices,

|Ev| =

|Vv |∑

i=1

deg(vi)

2. (5)

From (2), deg(v′) ≤ deg(vi), ∀vi ∈ Vv, and deg(v) = |Vv| − 1. Therefore,

|Ev| ≥(|Vv| − 1)deg(v′) + (|Vv| − 1)

2, (6)

|Ev| ≥(|Vv| − 1)(deg(v′) + 1)

2. (7)

Combining (4) and (7)deg(v′)|Vv|

2≥ (|Vv| − 1)(deg(v′) + 1)

2, (8)

deg(v′) ≥ |Vv| − 1. (9)

Equation 9 shows that the algorithm terminates only when the degree of v′ is |Vv|−1, the maximumpossible. This implies that all other vertices will also have degree |Vv|−1. In other words, the resultinggraph is a fully connected one, namely a clique. ¤

Page 5: Interaction Graph Mining for Protein Complexes Using Local Clique

264 Li et al.

2.2 Merging for Maximal Dense Neighborhoods

In an interaction graph with potentially incomplete interaction data, it is more likely for a large proteincomplex to be presented as a maximal dense neighborhood consisting of various local cliques than asa single large clique within the interaction graph. To detect such dense graphs which can match thelarger complexes better, the LCMA algorithm performs a merging step after the local cliques havebeen identified, adopting the heuristic of merging overlapping neighborhoods with comparable sizes.

Definition 3 Neighborhood Affinity. Given two neighborhoods (subgraphs) A and B, we define theNeighborhood Affinity NA between them as

NA(A,B) =|A ∩ B|2

|A| ∗ |B|. (10)

Equation 10 quantifies the degree of similarity between neighborhoods. If two neighborhoods havelarger intersection sets and similar sizes, then they are more similar and have bigger neighborhoodaffinity. Note that if one neighborhood’s size, e.g. |B|, is much bigger than |A|, then NA(A,B) willbe small since |A ∩ B|/|A| < 1 and |A ∩ B| ¿ |B|.

Given a set LC of local cliques, our LCMA algorithm tries to merge the local cliques that havea similarity greater than a threshold ω. The merging process is performed iteratively to update LC.The details of the local clique merging step of our LCMA algorithm are presented in algorithm 2. Theinput of the algorithm are the local cliques in LC, while the output—the predicted complexes—will bestored in the set C. Steps 4-24 comprise the main loop of the algorithm. In steps 6-14, we compute,for each local clique ni, its neighborhood similarity with all other local cliques in LC. Neighborhoodswith similarities greater than ω are then merged with ni. Both merged complexes(step 10) and thelocal cliques that did not undergo merging (steps 15-17) are added into C. In steps 18-23, we usethe average density (AD) in each iteration as a stop criteria — if the new average density of thecurrent iteration does not decrease much (i.e. less than 95% of last average density) and there arenew complexes produced from merging, then we update the average density value and continue themerging process. Otherwise the algorithm will stop.

The time complexity of this local clique merging step is O(l ∗ r2 ∗ v), where l is the number ofiterations(a constant and can thus be ignored), r the size of LC (which is smaller than the vertexnumber |V |), and v the average number of proteins in the local cliques. Usually v has small values,e,g, the average protein number of complexes in MIPS is around 6.38 [11].

3 Experimental Results

In this section, we evaluate our proposed LCMA algorithm on an actual interaction graph generatedfrom the experimental protein-protein interaction data of yeast. We have chosen to use yeast inter-action data to infer protein complexes as it is currently the organism with the most comprehensiveexperimental datasets available publicly. In order to show the effectiveness of our algorithm, we willcompare our results with the results from existing algorithms using the same datasets.

3.1 Dataset for Protein Interaction Graph

We use the same dataset collected by Bader and Hogue[1] for their MCODE algorithm to constructour protein interaction graph for mining complexes. The dataset was assembled from all machine-readable resources: Uetz [18], Ito [8], Drees [3], Fromont-Racine [4], Ho [7], Gavin [5], Tong [17],Mewes(MIPS) [11], Costanzo(YPD) [2]. In total, it consists of 15,143 experimentally determinedprotein-protein interactions among 4,825 yeast proteins.

Page 6: Interaction Graph Mining for Protein Complexes Using Local Clique

Interaction Graph Mining for Protein Complexes 265

Algorithm 2 LCMA algorithm(step 2):Mining maximal dense neighborhoods by merging local cliques1: BEGIN2: Stop = False;3: Average Density AD=

∑|LC|i=1 cc(ni)/|LC|, ni ∈ LC;

4: repeat5: C = ∅;6: for each ni ∈ LC do7: S = {ni};8: for each nj ∈ LC, j 6= i do9: if NA(nj , ni) > ω then

10: S = S ∪ {nj};11: end if12: end for13: C = C ∪ S;14: end for15: for each ni ∈ LC not used in a merge do16: C = C ∪ {ni};17: end for18: if

∑Ci=1 cc(ni)/|C| > 0.95 ∗ AD, ni ∈ C then

19: AD =∑|C|

i=1 cc(ni)/|C|, ni ∈ C;20: LC = C;21: else22: Stop = True;23: end if24: until (Stop = True) or (no new complexes are produced from merging)25: END

3.2 Protein Complex Gold Standard and Evaluation Metric

The benchmark that we use to evaluate our method against is a dataset of known yeast proteincomplexes retrieved from the MIPS: Comprehensive Yeast Genome Database (ftp://ftpmips.gsf.de/yeast/). Entries in this dataset have been curated from the biomedical literature. Note thatwhile it is currently one of the most comprehensive public datasets of yeast complexes available, itis still by no means a complete dataset — there are still many yeast complexes that remained tobe detected. After filtering the predicted protein complexes from the dataset, we obtained a finalset of 215 yeast complexes as our gold standard for evaluation. In it, the biggest protein complex,cytoplasmic ribosomes, contains 81 proteins while the average protein number of complexes is 6.38.

In addition to using the same MIPS dataset as the gold standard to evaluate the results, we alsoassess our proposed algorithm using the same evaluation metric adopted in the MCODE paper [1],where the definition of NA (definition 3) was used to determine matching between predicted complexesin P and MIPS complexes. For a predicted complex p ∈ P and a complex m ∈ MIPS, we consider thetwo complexes to be matching if NA(p,m) ≥ 0.2 which is the same threshold used in MCODE.

The set of true positives (TP ) is defined as TP = {p|NA(p,m) ≥ 0.2, p ∈ P, m ∈ MIPS }, namelythe predicted complexes with affinity NA (with MIPS complexes) at least 0.2. The set of false positives(FP ) is defined as FP = P −TP , namely the predicted complexes that are not in TP . The set of falsenegatives (FN) is defined as FN = {m|∀p(NA(p,m) < 0.2), p ∈ P, m ∈MIPS}, namely the knownMIPS complexes not matched by predicted complexes. Then the recall and precision can be definedas: Recall = |TP |/(|TP | + |FN |) and Precision = |TP |/(|TP | + |FP |).

In this paper, we use the F-measure to evaluate the performance of different techniques. F-measure

Page 7: Interaction Graph Mining for Protein Complexes Using Local Clique

266 Li et al.

Figure 2: An example of mining a localclique.

Figure 3: Merging multiple local cliques to match MIPScomplex.

takes into account of both precision and recall and is defined as F − measure = 2 ∗ Precision ∗Recall/(Precision+Recall). Note that because our reference set MIPS is incomplete, some predictedcomplexes which probably are true complexes will be regarded as false positives (FP ) if they do notmatch well with the current MIPS complexes. As such, the F-measure of the algorithms should notbe taken at their absolute values but only as comparative measures.

3.3 Experimental Results

Our LCMA algorithm aims to achieve high recall by trying to discover most of the dense regions inthe protein interaction graph by using the local cliques for each vertex as seeds. Our algorithm alsofilters away the loosely connected vertexes which could correspond to experimental false positives [13],to enhance its robustness (hence precision) against the high error rates in current protein interactiondata.

Figure 2 shows an example to illustrate the biological significance of mining the local cliques fora vertex YAL007C (a yeast protein) in our data. The vertices shown as rectangles form local cliquesfor vertex YAL007C. It is a clique containing 4 proteins that are fully connected with each other.All the other neighbors (shown as circle vertices) connect with YAL007C with one edge, which arepossibly false positive interactions [13]. In fact, on further checking on the functional informationof the proteins in the neighborhoods in MIPS (http://mips.gsf.de/desc/yeast), we found thatall the proteins in the cliques have the same function label of “vesicular transport (Golgi network,etc.)”(MIPS code “20.09.07”), while all the other proteins have different functions (not listed). Thisexample shows that the local cliques can lead to functionally consistent protein groups (hence morelikely to be members of a complex for a specific biological function), and our algorithm was effectivein removing the functionally inconsistent nodes from the neighborhoods.

The second step of our LCMA algorithm aims to further improve the precision by merging overlap-ping local cliques into maximal dense neighborhoods. Figure 3 shows an example in which an actual

Page 8: Interaction Graph Mining for Protein Complexes Using Local Clique

Interaction Graph Mining for Protein Complexes 267

MIPS complex cannot be matched by any individual local cliques but can only be discovered fromthe underlying interaction graph using the merging step to uncover the corresponding maximal denseneighborhood. The complex predicted by our method LCMA (shown in the figure within the largercircle) actually corresponds to a large MIPS complex consisting of 31 proteins. Let us look at how ouralgorithm arrives at this complex in closer details. First, in the algorithm’s local clique mining step,11 local cliques were found: other than YNL137C and YGR084C, each protein shown between the twocircles in the figure forms a clique of size 4 with the three proteins in the inner circle, while YNL137Cand YGR084C form a clique of size 5 with the inner proteins. The LCMA algorithm’s second stepthen merges all the proteins within bigger circle into a bigger dense neighborhood of 15 proteins, ofwhich 14 are parts of the MIPS complex. The remaining protein (YDR036C) in the bigger denseneighborhood have unknown function and could be an unknown component of the MIPS complex —its true membership, however, will have to be ascertained by the biologists experimentally. Obviously,we still have not discovered the entire 31-protein complex even with local clique merging to improvecoverage. However, this is could be due to the incompleteness of the protein interaction dataset usedto construct the underlying interaction graph.

Figure 4 shows the comparative results based on F-measures of different techniques, namely, theMCODE algorithm [1], and our LCMA algorithm with different values for the parameter ω for merginglocal cliques. For comparison, we also include the full clique results obtained by enumerating all non-redundant cliques(starting with an empty set, vertices were recursively added to the set as long asthe clique condition was satisfied). Note that unlike our scalable LCMA algorithm, full clique-findingalgorithms are not scalable to bigger protein interaction graphs.

Figure 4: Comparison of different techniques to match MIPScomplexes.

In terms of F-measure, our LCMAalgorithm achieves the best results.Compared with MCODE algorithm,LCMA algorithm with ω = 0 was ableto achieve 15.99% and 6.6% higher F-measure than MCODE algorithm andfull clique results respectively. Ourmethod can achieve 44.1% higher re-call and 4.4% higher precision thanMCODE. In addition, both LCMA andfull clique results can achieve higherrecall, but our LCMA algorithm canachieve 7.4% higher precision thanclique results.

We also tested our algorithm withdifferent ω values, namely ω = 0.5, ω =0.25, and ω = 0. We observed thatgenerally the recall does not change alot (71.3%-73.9%) while the precision increases with decreasing ω values (23.1%-30.2%). When ω = 0,our LCMA algorithm predicted 873 complexes, of which 264 matched 108 MIPS benchmark complexes(|TP |=264, |FP |=609, |FN |=107). The average Neighborhood Affinity between predicted complexesthat matched benchmark complexes and their best matched benchmark complex is 0.40.

Note that when ω is larger than 0.60, local cliques can seldom be merged because 0.60 or aboveis a strict threshold. For example, consider two local cliques A and B, both consisting of 12 proteins.Even if they have 9 overlapping proteins, the NA(A,B) is only 9 ∗ 9/(12 ∗ 12) = 0.56 which is belowthe threshold to qualify for merging. On the contrary, if ω is a smaller value, it is easier for twoneighborhood local cliques to merge. When ω is set to 0, all pairs of the local cliques will be mergedas long as there is a single common protein.

A more recent system [10] used a so-called “restricted neighborhoods search clustering” algorithm

Page 9: Interaction Graph Mining for Protein Complexes Using Local Clique

268 Li et al.

(RNSC) to predict protein complexes from protein interaction graph. RNSC predicted only 45 com-plexes which matched 30 MIPS complexes. In comparison, our LCMA algorithm was able to identifymany more of the MIPS complexes from the protein interaction graph.

Our LCMA algorithm also predicted complexes that do not match current MIPS complexes. Sincethe current MIPS complex set is largely incomplete, these unmatched complexes could potentially bereal complexes for biologists to explore further. In fact, we tried to match the predicted complexes(LCMA with ω = 0) that do not match MIPS complexes with complexes in the BIND database—out of 609 predicted complexes that did not match with MIPS complexes, 335 matched 339 BINDcomplexes (http://www.blueprint.org). As such, many of our current false positives are probablytrue positives, and the actual precision of our method should be higher than reported here.

4 Conclusions

Protein interaction graph mining can be used to identify graphical subcomponents for predictingprotein complexes. In this paper, we have described an efficient and effective technique to mineprotein complexes from protein interaction graphs by local clique merging. Evaluations of our proposedalgorithm showed that it has the following advantages over other existing methods:

• Higher F-measure (recall and precision). By taking a bottom-up approach that considers the lo-cal cliques for each vertex in the interaction graph and then merging the overlapping local cliquesfor maximal dense neighborhoods, our proposed algorithm can discover much more protein com-plexes in protein interaction graph than other algorithms. Our algorithm also guarantees thatthe discovered complexes are dense graphs in the underlying interaction graph. This enhancesthe accuracy because highly connected graphs are more likely correspond to the complexes.

• More efficient runtime. Our algorithm is also more efficient than current approaches. The timecomplexity of our overall LCMA algorithm (two steps) is O(r2v) (r < |V | is the number oflocal cliques and v usually has a small value) while the complexity of MCODE algorithm isO(nmh3) [1] (n, m are the number of vertices and edges respectively, h is the vertex size of theaverage vertex neighborhood) and finding maximal cliques from graph is a NP-hard problem [9].The running time of our LCMA algorithm is around 1 minute (when ω = 0) on an Intel PentiumM 1.86GHz with 512MB RAM.

As mentioned earlier, identifying protein complexes is critical for biological knowledge discovery sincemany important biological processes in the cell are carried out through the formation of proteincomplexes. However, there is currently a wide data gap between protein complexes and protein-protein interactions—technologies for detecting pairwise protein-protein interactions en masse havealready become routine in the laboratories for generating large datasets of protein interaction data,while the technologies for detecting protein complexes still remained highly painstaking and costly. Inthis work, we have shown how we can exploit the abundant protein interaction data to bridge the datagap for protein complexes—as our predicted complexes were shown to match or overlap well with theknown protein complexes in MIPS benchmark database, the unmatched complexes could potentiallybe real complexes. Our method can thus be used to identify novel protein complexes from proteininteraction graphs to help biologists in their continuing search for new protein complexes.

References

[1] Bader, G. D. and Hogue, C. W. V., An automated method for finding molecular complexes inlarge protein interaction networks, BMC Bioinformatics, 4(1):2, 2003.

Page 10: Interaction Graph Mining for Protein Complexes Using Local Clique

Interaction Graph Mining for Protein Complexes 269

[2] Costanzo, M. C., Crawford, M. E., Hirschman, J. E., Kranz, J. E., et al., YPD, PombePDand WormPD: Model organism volumes of the BioKnowledge library, an integrated resource forprotein information, Nucleic Acids Res., 29(1):75–79, 2001.

[3] Drees, B. L., Sundin, B., Brazeau, E., Caviston, J. P., et al., A protein interaction map for cellpolarity development, J. Cell Biol., 154(3):549–571, 2001.

[4] Fromont-Racine, M., Mayes, A. E., Brunet-Simon, A., Rain, J. C., et al., Genome-wide proteininteraction screens reveal functional networks involving Sm-like proteins, Yeast, 17(2):95–110,2000.

[5] Gavin, A.-C., Bosche, M., Krause, R., Grandi, P., et al., Functional organization of the yeastproteome by systematic analysis of protein complexes, Nature, 415(6868):141–147, 2002.

[6] Hartuv, E. and Shamir, R., A clustering algorithm based on graph connectivity, InformationProcessing Letters, 76:175–181, 2000.

[7] Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., et al., Systematic identification of protein com-plexes in Saccharomyces cerevisiae by mass spectrometry, Nature, 415(6868):180–183, 2002.

[8] Ito, T., Chiba, T., Ozawa, R., Yoshida, M., et al., A comprehensive two-hybrid analysis to explorethe yeast protein interactome, Proc. Natl. Acad. Sci. USA, 98(8):4569–4574, 2001.

[9] Karp, R., Reducibility among combinatorial problems, Miller, R. and Thatcher, J. (eds.), Com-plexity of Computer Computations, 85–103. Plenum, 1972.

[10] King, A. D., Przulj, N., and Jurisica, I., Protein complex prediction via cost-based clustering,Bioinformatics, 20(17):3013–3020, 2004.

[11] Mewes, H. W., Frishman, D., Gruber, C., Geier, B., et al., MIPS: A database for genomes andprotein sequences, Nucleic Acids Res., 28(1):37–40, 2000.

[12] Ng, S.-K., Zhang, Z., and Tan, S.-H., Integrative approach for computationally inferring proteindomain interactions, Bioinformatics, 19(8):923–929, 2003.

[13] Saito, R., Suzuki, H., and Hayashizaki, Y., Interaction generality, a measurement to assess thereliability of a protein-protein interaction, Nucleic Acids Res., 30(5):1163–1168, 2002.

[14] Sear, R. P., Specific protein-protein binding in many-component mixtures of proteins, Phys. Biol.,1(2):53–60, 2004.

[15] Spirin, V. and Mirny, L. A., Protein complexes and functional modules in molecular networks,Proc. Natl. Acad. Sci. USA, 100(21):12123–12128, 2003.

[16] Tan, S.-H., Sung, W.-K., and Ng, S.-K., Discovering novel interacting motif pairs from largeprotein-protein interaction datasets, Proc. BIBE, IEEE Computer Society, 568–575, 2004.

[17] Tong, A. H. Y., and Drees, B., A combined experimental and computational strategy to defineprotein interaction networks for peptide recognition modules, Science, 295(5553):321–324, 2002.

[18] Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., et al., A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae, Nature, 403(6770):623–627, 2000.

[19] van Dongen, S., Graph clustering by flow simulation, Ph.D. thesis, University of Utrecht, 2000.