JOURNAL OF LA Reachability analysis in probabilistic ...atodor/reachability_journal.pdf · JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1 Reachability analysis in probabilistic

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1

Reachability analysis in probabilistic biologicalnetworks

Haitham Gabr, Andrei Todor, Alin Dobra and Tamer Kahveci

Abstract—Extra-cellular molecules trigger a response inside the cell by initiating a signal at special membrane receptors (i.e., sources),which is then transmitted to reporters (i.e., targets) through various chains of interactions among proteins. Understanding whether sucha signal can reach from membrane receptors to reporters is essential in studying the cell response to extra-cellular events. This problemis drastically complicated due to the unreliability of the interaction data. In this paper, we develop a novel method, called PReach(Probabilistic Reachability), that precisely computes the probability that a signal can reach from a given collection of receptors to agiven collection of reporters when the underlying signaling network is uncertain. This is a very difficult computational problem with noknown polynomial-time solution. PReach represents each uncertain interaction as a bi-variate polynomial. It transforms the reachabilityproblem to a polynomial multiplication problem. We introduce novel polynomial collapsing operators that associate polynomial termswith possible paths between sources and targets as well as the cuts that separate sources from targets. These operators significantlyshrink the number of polynomial terms and thus the running time. PReach has much better time complexity than the recent solutions forthis problem. Our experimental results on real data sets demonstrate that this improvement leads to orders of magnitude of reductionin the running time over the most recent methods.Availability: All the data sets used, the software implemented and the alignments found in this paper are available at http://bioinformatics.cise.ufl.edu/PReach/.

Index Terms—Signaling Networks, Probabilistic Networks, Reachability, network topology, degree distribution, random graphs

F

1 INTRODUCTION

Major discoveries in biology have revealed the complex-ity of living organism at the cell level. A multitude ofmolecules such as proteins or genes interact to sustainthe various functions served within cells. These interac-tions are often modeled as biological networks, wheremolecules are represented as nodes and interactions asedges. Among a diversity of biological networks, sig-naling networks describe how extra-cellular moleculestrigger a response inside the cell. This often takes placeby initiating a signal at special membrane proteins, (i.e.,receptors). This signal travels through chains of inter-actions among proteins, until it reaches a transcriptionfactor (i.e., reporter). The transcription factor then mate-rializes the cell response as increase or reduction of theproduction of other proteins. Understanding this mech-anism is of utmost importance since its malfunctioningis the cause of many disorders such as cancer, type 2diabetes, neurodegeneration, Alzheimer, congenital mal-formations, obesity and osteoporosis [33], [36], [41].

One of the first steps in studying signaling net-works is to establish their topologies. A common wayto discover the physical interaction data is to uselarge scale experiments like yeast-two-hybrid and co-immunoprecipitation assays [9], [14]. These techniquesproduce high throughput results but they also introducemany false positives. The data produced are therefore

• The authors are with the Department of Computer and Information Scienceand Engineering, University of Florida, Gainesville, FL, 32611.E-mail: hgabr,atodor,adobra,[email protected] work was supported partially by NSF under grant IIS-0845439.

uncertain. In other words an interaction may or may nottake place with some probability.

Another known form of uncertainty in signaling net-works is whether the underlying interactions hold underextreme conditions like severe trauma and burn injury.Research shows that stresses from such conditions af-fect more than 80% of cellular functions and pathways,resulting in a genomic storm [66]. Therefore even forcurated signaling networks, an uncertain model encom-passing such conditions is of utmost importance.

In the rest of the paper, we will use the term probabilis-tic network to denote a network with at least one uncer-tain reaction. To quantify the level of uncertainty, somerecent interaction databases like MINT and STRINGprovide confidence values for each interaction [57], [7].Various other methods have been proposed for rigorousassessment of the quality of the interaction data [60], [3].

To define external factors that lead to various kindsof alterations in cell functions, it is important to knowwhether a signal can reach from a receptor to a reporter,which is the problem considered in this paper. Forexample, the MAPK signaling network is responsible foroncogenesis; the tumor necrosis factor (TNF) is linkedto jun proto-oncogene (c-Jun) transcription factor andthe knowledge about this connection is important incancer and neurodegenerative diseases research [31]. Inwhat follows we define the computational problem weaddress here.

Problem statement. In this paper, we address the problemof finding whether a signal travels from a source (a given setof receptors) to the intended target (a given set of reporters)when each edge is subject to failure with a different probability,


independent of the state of other edges.We formally define a probabilistic signaling network as a

directed graph G = (V,E, P ), where V is the set of nodes(i. e. proteins), E is the set of edges (i.e. interactions) andP is a function that associates to each edge the prob-ability of the existence of the interaction it represents.We assume that each edge exists independently of allother edges. This assumption is commonly used in theliterature for similar problems [7], [57].

The problem tackled in this paper is #P-complete. Toprove this let us call the problem considered in thispaper, namely, s − t reachability in a network whereeach edge is subject to failure with a possibly differentprobability, problem A. Let us call the s− t reachabilityin a network where the edges are subject to failurewith the same probability p problem B. To show thatproblem A is #P-hard we reduce problem B to problemA. B is a well known #P-complete problem [5], [48].Clearly, if there were a polynomial time algorithm tosolve problem A we can use it to solve problem B: justset all the edge failure probabilities to the same givenvalue. Also, we can solve problem A in exponential timeby enumerating all deterministic instances and decidingfor each deterministic instance whether t is reachablefrom s in polynomial time. Therefore the problem is in#P and thus it is #P-complete.

One might be tempted to solve the problem by com-puting the probability of reaching from a source to atarget through a series of edges – a path – for each ofthe simple paths in the network, and then sum thoseprobabilities. For example, in Figure 1, this would leadto the solution p1p2 + p3p4 + p3p5p2. The problem withthis approach is that the events represented by eachof the terms of the sum are not mutually exclusive,so their probabilities cannot be simply added. A moresophisticated formula would be p1p2q3q4q5+p3p4q1q2q5+p3p5p2q1q4, making the three events corresponding toeach path mutually exclusive. The problem now is thatthe formula does not cover multiple paths being presenttogether. Adding all possible arrangements of the threepaths leads to the inclusion-exclusion formula, a methodagainst which we compare our method.

One way to appreciate the difficulty of the problemis to realize that a probabilistic network represents alarge number of possible configurations of deterministicnetworks, depending on the presence or absence of eachprobabilistic edge. More precisely, a network with nprobabilistic edges yield 2n possible network configu-rations, as each one of the n edges may be present orabsent. For instance, the probabilistic network in Figure 1yields 32 alternative topologies (two of them shownat the bottom of Figure 1). An exact solution for thedescribed problem would be to naively consider all ofthe 2n possible cases. Also attesting to the difficulty ofthe problem is the fact that many approximate solutionshave been proposed. However, as we discuss later inmore detail, all these methods are either inaccurateor they fail to scale beyond very small networks. We

elaborate on the existing literature in Section 2.Contributions. In this paper, we develop a novel

method, called PReach (Probabilistic Reachability), thatsolves the problem described above exactly. PReach isinspired by the probability generating functions. In arecent study, we showed that these functions can beemployed for aligning probabilistic networks [58]. Weassociate a polynomial function to each edge in the givennetwork, defined by the edge probability. We call thesepolynomials the edge polynomials. We prove that thesolution to the problem is given by a subset of theterms of the polynomial obtained as the multiplicationof all the edge polynomials. We avoid excessive growthof the number of terms of the resulting polynomialby collapsing combinations of terms that we alreadyknow lead to feasible configurations, and those thatcannot lead to feasible configurations, into a single termeach. We also show that the order in which the edgepolynomials are multiplied dictates how soon we cancollapse the polynomial terms. Thus, the order affects thespeed at which the final polynomial grows. We developa strategy to order the edge polynomials to maximizethe number of collapsed terms (thus to minimize thepolynomial size).

Our experimental results on real data sets demonstrateup to 10,000 times improvement in running time over thecompeting method when it could run till completion.PReach scales up to larger networks easily. On suchnetworks, it returned the results in minutes when thecompeting method failed to complete. We also performbiological validation experiments for our method. Ourvalidation experiments show that the highly reachableproteins in the examined networks have important rolesin biological functions like cytoneme morphogenesis andmitotic nuclear pore complex reassembly. Our methodcan easily be adapted to undirected networks by replac-ing each undirected edge with two edges in oppositedirections.

The rest of the paper is organized as follows. Section 2discusses relevant background. Section 3 describes ourmethod in its mathematical and algorithmic formulation.Section 4 presents the experimental evaluation of PReachand compares it with a recent existing method. Section 5concludes the paper with a brief discussion.

2 BACKGROUND AND RELATED WORK

In the literature, the reachability problem in probabilis-tic networks has been studied mostly in the contextof network reliability [24], [1], [63], [35]. The commondrawback of all the available methods is that they donot scale well with growing network size. Attempts havebeen made in the literature to compute the reachabil-ity probability based on formulas applied to series orparallel configuration of edges [40], [39], [42]. For otherconfigurations a general formula based on conditioningproposed by Mine [38] can be used. By this formula, the


c

d

c

d

c

d

a

b

a

b

a

b

e3

e1 e2p1 p2

p3e4p4

p5e5

e3 e5e4

q1q2p3p4p5

e3

e2

q1p2p3q4q5

Fig. 1. A probabilistic network (top) and two of the deterministic networks

corresponding to it (bottom). Each of the deterministic networks is obtained from

the probabilistic network with some probability determined by the probabilities of

the edges that are included or not in the deterministic network. The expression

above each deterministic network is the probability of observing it.

reachability probability can be computed by condition-ing on the state of one of the edges, thereby factoringout that edge.

A major class of existing methods is based on enu-merating paths or cuts [49], [19], sometimes combinedwith the inclusion-exclusion method [63], [35]. Lucet [35]presents a method for enumerating paths based on theadjacency matrix. Shen [52] presents an algorithm forenumerating all paths and cuts of a planar graph. Raiand Aggarwal [50] describe a more efficient methodbased on the system success determinant. Okada [43]and Mine [38] have developed a method for using theincidence and circuit matrices of a graph to find all pathsbetween a specified pair of nodes in order to computethe reachability probability between those nodes. Wingand Kim [65] studied the properties of the matrix of allpossible paths between nodes s and t in a connectedand non-oriented graph G. This matrix is called the pathmatrix. It has one row for each path between nodess and t and a column corresponding to each edge inG. The matrix element is 1 if the edge correspondingto the column is part of the path corresponding tothe row. They have shown that the rank of the pathmatrix is |E| − |V | + 2 and presented an algorithm forfinding a sub-matrix of the path matrix whose rows arelinearly independent. Fu and Yau [12] have presentedthe formula for the s-t reachability probability based onthe inclusion-exclusion principle, using the path matrixand have also developed a technique for enumerating allpaths. This technique combines edges connected in seriesor in parallel and also eliminates a non-terminal nodeamong terminal nodes. This transformation is based ona previous technique by Hohn and Schissler [22]. Intheir work, the authors present several techniques for thereduction of combinational circuits based on connectionmatrices. Brown [5] introduced an algorithm based onthe binary representation of the transmission function of

the circuit, which also enumerates all paths.For computing all-terminal reachability special tech-

niques have been proposed. Wing and Demetriou [64]proposed a method which examines all 2|E| − 1 non-empty sub-graphs of a graph and accumulates the prob-abilities of the connected sub-graphs. An approximatemethod was developed by Fu [11]. Other methodsfor K-terminal reliability are those based on spanningtrees [25], [47].

Wilkow [63] notes that there are at most 2|E|−|V |+2

possible paths in a graph G = (V,E), therefore the num-ber of paths grows rapidly in dense networks. Also, thenumber of minimal cuts is O(2|E|−2), implying that fordense networks working with cuts instead of networksis more advantageous. He presents a general formulafor computing failure probability based on cuts. A moreefficient algorithm was developed in [18], [17], whichcalculates a minimal set of mutually exclusive eventscontaining all minimal cuts. The probability of failure issimply accumulated from the probabilities of each of themutually exclusive events. The discovery of the minimalcuts is based on [62]. Several other methods have beenproposed for computing all minimal cuts [34], [37], [54],[59], [26], [20], [67].

One of the most recent solutions (and possibly themost relevant one from the point of this paper) that fallinto this category is used by a method called SPINE [45].This method aims to infer signaling and regulatory path-ways. As a part of this goal, it computes the reachabilityprobability between pairs of nodes. SPINE uses theinclusion-exclusion method to compute the reachabilityprobability without error in order to reconstruct signal-ing pathways. SPINE assigns an activation/inhibitionattribute to each protein in such a way that the networkcan satisfy the maximum number of observed geneknockouts. For a given knockout pair, it first calculatesthe probability that the target node is affected by theknockout gene using the inclusion-exclusion principle.After finding the reachability probability, it provides alinear programming formulation to assign the attributesto proteins. In this paper, we will compare the per-formance of PReach with the approach used in SPINEfor calculating the reachability probability, since bothmethods compute this probability precisely. The timecomplexity of this method is O(2N ) with N being thenumber of all simple paths from the source to the target.As we present in our experimental results (see Section 4) thismethod does not work for many real networks even for thosewith less than 40 interactions.

Due to the computational difficulty of the problem,some methods resort to Monte Carlo approximationsrather than computing precise probabilities [30], [10], [8],[32]. The disadvantage of these methods are (i) theydo not produce an exact result and (ii) they need toomany iterations to obtain a low error margin. Someof them focus on how to deploy a sampling policy orplan to reduce the sampling error. Zhu et al. [69] triedto overcome the need for a large number of samples


by proposing a dynamic Monte Carlo simulation. Thismethods uses pruning based on some effective boundingtechniques and if the pruning fails, it uses Monte Carlosimulation. To control the dramatic increase in the num-ber of samples and the number of iterations required toguarantee the accuracy, the dynamic method will onlysimulate necessary parts of the graph and will sharemost of the overlapping cost between different iterations.Similar to our method, the authors observe that whensome edges are missing, the presence of some otheredges are no longer relevant, which reduces the samplespace as edges are considered one by one.

A related, yet slightly different problem is the con-nectedness reliability problem. This problem seeks theprobability that a network is connected, under the as-sumption that each edge is subject to failure with thesame probability p. A version of this problem is the s-tconnectedness reliability, in which one seeks to find theprobability that a target node t is reachable from a sourcenode s. Both versions are known to be #P-complete [48],[6]. The time complexity of connectedness reliability wasshown to be exponential in Ω(|E|/ log2 |E|) [23].

Probabilistic networks have been analyzed in the pastvia polynomial manipulation. The reliability polynomial,a particularization of the Tutte polynomial, has receivedwide attention [46], [5]. This model, however, is limitedto a single parameter, representing the probability thatany edge in the network fails. Thus it is limited todeal only with global properties of the network, suchas whether it is connected or not. It cannot encodemore specific information, such as s-t connectivity. Themultivariate Tutte polynomial was proposed as a gener-alization for the case in which each edge has a differentfailure probability, but again is limited to all-nodes re-liability [56]. In the field of biological networks, someof the authors have proposed a detailed, comprehensivemodel based on probability generating functions forprobabilistic alignment [58].

3 METHOD

In this section we describe our method. First, we buildthe mathematical foundation of our method (Section 3.1).We then establish the relationship between this theoryand probabilistic networks (Section 3.2). We then use thisrelationship to model the reachability problem in prob-abilistic networks (Section 3.3) Finally, we present thePReach algorithm that solves the reachability problemon probabilistic networks (Section 3.4).

3.1 A special class of polynomialsIn this section, we develop an expressive polynomialclass, called the xy-polynomial, and the basic algebrathat operates on it. As we will demonstrate in Sec-tions 3.2 and 3.4, these polynomials naturally describethe topologies of all the deterministic instances of a givenprobabilistic network. Thus, the reachability problem ina given probabilistic network will turn into a simple

evaluation of this polynomial. We start by introducingthe notation that will help present our method.

Consider two sets of n variables X = x1, . . . , xnand Y = y1, . . . , yn. Also consider the set of indicesU = 1, . . . , n. Each subset S of U determines a subsetof X and a subset of Y : xi|i ∈ S and yi|i ∈ S,respectively. We denote the product of the variables inthese two subsets with xS =

∏i∈S xi and yS =

∏i∈S yi.

Next, we introduce a class of polynomials, called the xy-polynomials, that play an important role in our method.

Definition 1. Let Θ be a subset of U . Let S1, . . . , Sk be kdifferent subsets of Θ, where k is a positive integer. Let x∗

and y∗ be two free variables. Let a1, . . . , ak be real numbers.Similarly, let b and c be two real numbers. An xy-polynomialover Θ is a polynomial of the form

F =

k∑i=1

aixSiyΘ\Si + bx∗ + cy∗ (1)

Notice that in the above summation, each term —except for the free variables — contains each of theindices j ∈ Θ either as a product term xj or yj . Thefollowing example clarifies this.

Example 1. Assume that Θ = 1, 2, 3, 4. This implies thatx1x3y2y4 + 2y1y2y3y4 + 3y∗ is an xy-polynomial. However,the two polynomials x3x4 and x1x2x3x4y4 + 4x∗ + y∗ arenot valid xy-polynomials. This is because the former one hasa term that is missing the variables with indices 1 and 2 (i.e.,x3x4 has indices 3 and 4 only). The latter one has a termthat has more than one variable with the same index (i.e.,x1x2x3x4y4 contains both variables x4 and y4).

Let Θ be a subset of U . Let P(Θ) be the power setof Θ. Consider a set S ∈ P(Θ). Consider a predicate Rdefined on P(Θ). Next, we define a set indicator functionχ : P(Θ) → 0, 1, such that χ(S) = 1 if S satisfies Rand χ(S) = 0 otherwise. We define an operator, calledthe x-collapsing operator, with respect to a set indicatorfunction χ. This operator is applied to an xy-polynomialF . It replaces all the terms in F of the form axSyΘ\Swith ax∗ if χ(S) = 1.

The y-collapsing operator is defined analogously. Wewill denote the set indicator function for the y-collapsingoperator with ω : P(Θ)→ 0, 1. The following definitionpresents these concepts formally.

Definition 2. (COLLAPSING OPERATOR) The x-collapsingoperator, applied to an xy-polynomial over Θ,F =

∑ki=1 aixSiyΘ\Si +bx∗ + cy∗, with respect to set

indicator function χ is defined as

ρxχ(F ) =

k∑i=1

aixSiyΘ\Si(1− χ(Si))+(b+

k∑i=1

aiχ(Si)

)x∗ + cy∗

(2)


We call the result of the collapsing operator the col-lapsed polynomial. Notice that the xy-polynomials areclosed under the x- and y-collapsing operations.

We observe a special pair of set indicator functions(χ, ω) when χ(S)ω(Θ \ S) = 0 for all S ∈ P(Θ). The x-and y-collapsing operators using such indicators χ andω never collapse the same term since we never haveχ(S) = 1 and ω(S) = 1 for the same set S. We alsoassume that χ and ω also satisfy the property that ifthey evaluate to 1 for a set S, then they evaluate to 1for any superset of S. The collapsing operators ρxχ() andρyω() satisfying this property commute with each other.In other words, ρyω(ρxχ(F )) = ρxχ(ρyω(F )). We can combinethe successive application of two operators into a singleone, the collapsing operator. We denote the resultingoperator with the notation ρχ,ω(F ) = ρyω(ρxχ(F )).

Example 2. Let χ(S) = 1 if 2 ∈ S and 0 otherwise. Letω(S) = 1 if 1, 2 ⊂ S and 0 otherwise.ρχ,ω(3x1x2x3y4 +2x1x3y2y4 +4x3x4y1y2 +5y∗) = 3x∗+

2x1x3y2y4 + 9y∗. This is because the first term contains x2,so it collapses to 3x∗. Also the third term contains y1 and y2,so it collapses to 4y∗.

We define the multiplication of xy-polynomials as reg-ular polynomial multiplication with a minor difference;any variable multiplied with x∗ or y∗ is absorbed intox∗ or y∗ respectively. In other words, for any S, T ⊆ U ,xSyTx

∗ = x∗ and xSyT y∗ = y∗. In particular, x∗x∗ =

x∗, y∗y∗ = y∗. We formally define the multiplication oftwo xy-polynomials below.

Definition 3. (XY-PRODUCT) Given U = 1, . . . , n, letΘ1 and Θ2 be two disjoint subsets of U . Let S1, . . . , Sk besubsets of Θ1, and T1, . . . , Tr be subsets of Θ2. Consider twoxy-polynomials, F1 over Θ1 and F2 over Θ2, such that F1 =∑ki=1 αixSiyΘ1\Si + b1x

∗+ c1y∗, F2 =

∑ri=1 βixTiyΘ2\Ti +

b2x∗ + c2y

∗. We define the xy-product F1 F2 as

F1 F2 =k∑i=1

r∑j=1

αiβjxSixTjyΘ1\SiyΘ2\Tj

+

(b2

k∑i=1

αi + b1

r∑i=1

βi + b1b2

)x∗

+

(c2

k∑i=1

αi + c1

r∑i=1

βi + c1c2

)y∗

+(b1c2 + c1b2)x∗y∗

(3)

Example 3. (2x1y2 + 2y1y2 + x∗) (x4y3 + 3x∗ + y∗) =2x1x4y2y3 +6x∗+2y∗+2x4y1y2y3 +6x∗+2y∗+x∗+3x∗+x ∗ y∗ = 2x1x4y2y3 + 2x4y1y2y3 + 16x∗ + 4y∗ + x∗y∗

The following theorem establishes a very useful resulton the interplay between x-y polynomial multiplicationand collapsation.

Theorem 1. If F1 and F2 are xy-polynomials and χ and ωare two set indicator functions, then

ρχ,ω(F1 F2) = ρχ,ω(ρχ,ω(F1) ρχ,ω(F2)) (4)

Proof: this, Suppose that a term xS is collapsedin F1 F2 on the right hand side of the equality inTheorem 1. Some of the indices in S come from F1 andothers from F2. Let S1 and S2 denote the indices in Sappearing in F1 and F2 respectively (i.e., S = S1 ∪ S2

and S1 ∩ S2 = ∅). We have xS = xS1xS2 . Suppose thatχ(S1) = χ(S2) = 0 and χ(S) = 1. Then on the righthand side neither xS1

nor xS2are collapsed in F1 and

F2, respectively, so another application of the collapsingoperator is necessary on the xy-product.

3.2 Probabilistic networks as xy-polynomialsSo far we have defined the xy-polynomials and the keyoperators on them. In this section, we explain how wemodel probabilistic networks and the reachability prob-lem considered in this paper using these polynomials.

Let G = (V,E, P ) be a probabilistic network with E =e1, . . . , en. For each edge ei ∈ E, let pi = P (ei) be theprobability that ei exists and qi = 1 − pi be that it isabsent. We start by transforming a single edge of G intoa polynomial.

Definition 4. (EDGE POLYNOMIAL) We define the edgepolynomial for each edge ei ∈ E as the first degree polyno-mial of two variables xi and yi,

Fi(xi, yi) = pixi + qiyi. (5)

Consider a set of edges E′ ⊂ E. We define theproduct of the edge polynomials of the edges in E′ asthe edge aggregation polynomial. The following definitionintroduces this formally.

Definition 5. (EDGE AGGREGATION) The edge aggrega-tion polynomial of the set of edges E′ is the polynomial

F =∏ei∈E′

Fi(xi, yi). (6)

Following from Definition 5, the edge aggregationpolynomial for the given set of edges E′ is

F =∑E⊂E′

∏ei∈E

pixi∏

ej∈E′\E

qjyj (7)

Example 4. The edge aggregation polynomial of the sete2, e3 in Figure 1 is F (x2, x3, y2, y3) = (p2x2 +q2y2)(p3x3 + q3y3) = p2p3x2x3 + p2q3x2y3 + p3q2x3y2 +q2q3y2y3.

Notice that the edge aggregation polynomial definedabove is an xy-polynomial with b = c = 0 in Definition 1.An edge aggregation polynomial has two importantproperties. First, each term contains the product of theexistence probability (i.e., pi) for a subset of edges ofE′ and the absence probability (i.e., qi) for all the re-maining edges in E′. Thus, each term’s coefficient is theprobability to observe a specific network instance whereonly the edges corresponding to the xi variables exist. Inother words, a term precisely models a subnetwork. Forinstance in Example 4, p2q3x2y3 models the subnetwork


where e2 is present and e3 is absent. The coefficient p2q3

is the probability that e2 is present and e3 is absent. Sec-ond, each possible subnetwork (i.e., all subsets of edgesin E′) appears as exactly one term in the aggregationpolynomial. For instance in Example 4, the two edgesin E′ yield four possible subnetworks (i.e., each edge isabsent or present). The edge aggregation polynomial hasfour terms, each modeling one of these subnetworks.

When E′ = E we call the resulting polynomial thegraph polynomial. The graph polynomial is an exhaus-tive representation of all the deterministic instances thatcould be generated by the probabilistic network model,along with their probability of realization. For instance,consider the probabilistic network in Figure 1. In itsgraph polynomial, the term that contains x3x4x5y1y2

corresponds to the network instance that only containsedges e3, e4 and e5 (bottom-left network in Figure 1).The coefficient of each term is the probability that thedeterministic network corresponding to that term is re-alized. For example, the probability of observing the de-terministic network shown on the bottom-left of Figure 1is p3p4p5q1q2. Using the possible worlds semantics [51],the probability that the network satisfies any predicatecan be computed by summing the probabilities of thepossible deterministic worlds in which the network sat-isfies the predicate. This is equal to the sum of thecoefficients of the polynomial terms corresponding to thedeterministic networks that satisfy the given predicate.

3.3 Modeling reachability in probabilistic networks

We are now ready to describe how we use xy-polynomials to find the probability that a given targetnode t can be reached from a given source node s. Toachieve this, we need a predicate that can characterizethe event that t can be reached from s in a deterministicnetwork instance. We do it by considering a specialdeterministic instance of the network G = (V,E, P ), inwhich all edges in E exist with certainty. We denotethis network with G = (V,E). We compute two piecesof information from G. The first one is the set of allpossible simple paths from s to t. The second one is theset of all possible minimal s-t cuts. We define a simplepath as a sequence of edges with no repeated nodesalong the edges. An s-t cut is a subset of edges of thegiven network such that their removal from the networkdisconnects s from t. We say that an s-t cut is minimalif none if its subsets is an s-t cut. We denote the setof paths and the set of cuts with Ψ = π1, . . . , πk andΦ = κ1, . . . , κl respectively. For example in Figure 1,e1, e2 is a path and e1, e3 is a minimal s-t cut. Inthis example, e3, e4 is not an s-t cut because t isstill reachable from s after removing e3 and e4. Also,e1, e3, e4 is not a minimal s-t cut because it has a subset(i.e., e1, e3) that is also an s-t cut.

We are now ready to present our main result, which isthe mathematical foundation of the PReach algorithm. Itshows that, given a probabilistic network and a start and

a target node, by applying the collapsing operator, wecan reduce the graph polynomial to only two terms; oneof them containing only the variable x∗ and the otheronly y∗. It uses Ψ and Φ to construct the x-collapsingand y-collapsing operators. The coefficient of x∗ is theprobability that the target node is reachable from thestart node and the coefficient of y∗ is the complement ofthe coefficient of x∗.

Theorem 2. Let G = (V,E, P ) be a probabilistic networkand F be the graph polynomial for G. Let s, t ∈ V be twonodes of G. Let Ψ be the set of all simple paths from s to tand Φ the set of all minimal s-t cuts in G. For all π ∈ Ψand κ ∈ Φ, let Ind(π) and Ind(κ) be the set of indices of theedges in π and κ respectively. Define set indicator variableχ(S)=1 if ∃π ∈ Ψ such that Ind(π) ⊆ S and 0 otherwise.Similarly, define ω(S)=1 if ∃κ ∈ Φ such that Ind(κ) ⊆ S and0 otherwise. If Ψ 6= ∅, then the collapsed graph polynomialwith respect to χ and ω is ρχ,ω(F ) = γx∗+ (1−γ)y∗, whereγ is the probability that t is reachable from s.

Proof: We will prove the theorem in four steps: i)there is no term in the collapsed graph polynomial thatcontains x∗y∗, ii) there is no term that contains xi or yi,for any i = 1, . . . , n, iii) the coefficients of the two termsthat contain x∗ and y∗ add to one. Note that, becauseof the way collapsing is defined, there can be no termscontaining both xi or yi on one hand, and x∗ or y∗ onthe other hand, so i) to iii) exhaust all possibilities ofvariable combinations. iv) The coefficient of x∗ is theprobability that t is reachable from s.i) Note first that x∗ stands for those deterministic

network configurations that contain all the edges alongsome path, and y∗ stands for those configurations thatcontain some cut, that is, do not contain any of the edgesneeded for a path to exist from s to t. Suppose there is aterm that contains both x∗ and y∗. Because it contains x∗,there is a configuration that contains some path from sto t. On the other hand, because it contains y∗ , the samenetwork configuration does not contain any of the edgesneeded to be present for a path to exist from s to t – acontradiction.ii) Suppose that there is a term that contains some xi.

This means it does not contain x∗ and it does not containy∗, so there is a configuration in which there is no pathfrom s to t, and no cut with respect to s and t. Thiscould happen only if in the probabilistic network thereis no probabilistic path from s to t, which contradicts thehypothesis.iii) Recall that the sum of all the coefficients in the

graph polynomial is 1. At the same time, From Def-inition 2 we see that the collapsing operator does notchange the sum of the coefficients of their operands.iv) The collapsing operator replaces some products of

variables in the graph polynomial by a single variable x∗.Because of the definition of χ and ω, those products ofvariables correspond to network configurations in whichthere is a path from s to t. The sum of the coefficientsof these terms is the probability that t is reachable from


s.

3.4 PReach: the algorithmIn Section 3.3, we showed that we can compute thereachability probability in a probabilistic network bycollapsing its graph polynomial with an appropriatecollapsing operator ρχ,ω(). The central hurdle in doingthis is to compute the graph polynomial. This is becauseit requires multiplying all edge polynomials of the givenprobabilistic network. Given that each edge polynomialhas two unique variables, such multiplication leads to agrowth of the number of terms exponential in the num-ber of edge polynomials. For instance, the probabilisticnetwork in Figure 1 has five edges. Therefore, the prod-uct of the corresponding five edge polynomials yield25 = 32 terms. The complexity of computing the reacha-bility probability is dominated by the number of termsin the graph polynomial. In this section, we describe thePReach algorithm, which computes the reachability fromthe graph polynomial while dramatically reducing thenumber of its terms. PReach uses the collapsing operatorcarefully to reduce the size of the graph polynomialwhile multiplying edge polynomials. First, we presentTheorem 3, which lays the foundation of the PReachalgorithm.

Theorem 3. Let G = (V,E, P ) be a probabilistic network.Let E1 ⊂ E and E2 ⊂ E be two disjoint sets of edges of G.Let F1 and F2 be the edge aggregation polynomials of E1 andE2, respectively. Let χ and ω be two set indicator functions.The collapsed edge aggregation polynomial with respect to χand ω corresponding to E1 ∪ E2 is

ρχ,ω(F1F2) = ρχ,ω(ρχ,ω(F1) ρχ,ω(F2)) (8)

Proof: Follows immediately from Theorem 1, since(i) the edge aggregation polynomial of a union of twodisjoint sets of edges is the xy product of the edgeaggregation polynomials of the two sets, (ii) F1 and F2

do not contain x∗ or y∗ terms, so F1F2 = F1 F2.Following Theorem 3, our method first finds the set of

all simple paths Ψ and all minimal s− t cuts Φ betweenthe given start and target nodes s and t. To find allthe paths, we use a simple backtracking algorithm thatexplores the edges in depth-first order and records thecurrent path as the target is reached. To find all cuts, westart from the source node. All outgoing edges from thisnode form a cut. In a recursive manner, we consider allsubsets of these edges and eliminate them from the cut,at the same time adding the outgoing edges from thenodes reached by the subset of edges that was added.

We build the indicator functions χ and ω from thesepaths and cuts respectively as described in Theorem 2. Inother words, χ() returns 1 only if the underlying subsetof edges contains at least one path from Ψ. Similarly,ω() returns 1 only if the underlying subset of edgescontains at least one s-t cut from Φ. Then we compute thegraph polynomial by multiplying the edge polynomialsand reduce the size of the resulting polynomials using

TABLE 1Computation of reachability probability from node a to node c in the probabilistic

network shown in Figure 1. Column denoted by EP shows the number of terms in

the edge aggregation polynomial, and the column denoted by CP shows the

number of terms in the corresponding collapsed polynomial.

Edge aggregation Collapsed polynomial #Termspolynomial CP EP

F1 0.5x1 + 0.5y1 2 2F1F2 0.52(x1y2 + x2y1 + y1y2 + x∗) 4 4

F1F2F30.53(x1x3y2 + x2x3y1 + x3y1y2)+ 0.25x∗ + 0.375y∗

5 8

F1F2F3F4 0.54x2x3y1y4 + 0.4375x∗+ 0.5y∗ 3 16F1F2F3F4F5 0.46875x∗ + 0.53125y∗ 2 32

the collapsing operator one edge at a time. Notice thatTheorem 3 ensures that we can avoid computing theentire uncollapsed graph polynomial by collapsing theproduct polynomial after each xy-multiplication as soonas a term contains a path or a cut.

Example 5. Let us apply the PReach algorithm to thenetwork in Figure 1, where for simplicity we assume thatall probabilities are 0.5. We will compute the probability thatnode c is reachable from node a. The set of all simple pathsbetween the two nodes is e1, e2, e3, e4, e2, e3, e5.The set of all minimal s − t cuts is e1, e3, e2, e3,e2, e4, e1, e4, e5. In Table 1 we present the collapsedgraph polynomial after adding each edge.

As we discussed earlier, after multiplying k edge polyno-mials, without collapsation the edge aggregation polynomialhas 2k terms. Notice however that the collapsed polynomialremains very small throughout this example. More specifically,we obtain the first free variable x∗ in the product F1F2 sinceedges e1 and e2 form a path from source (i.e., a) to target (i.e.,c) in Figure 1. At this step both the collapsed and uncollapsededge aggregation polynomials have four terms. However, whenwe multiply the result with the next edge polynomial F3, thefree variable absorbs all the terms in F3. Furthermore, thisproduct creates the next free variable y∗ since edges e1 ande3 form a minimal s − t cut. As a result, the collapsed edgeaggregation polynomial has nearly half as many terms as theuncollapsed one.

As we multiply the next edge polynomial F4, both x∗ andy∗ absorb the new terms. Also edge e4 enables collapsing someof the existing terms in either x∗ or y∗ by forming new pathsor cuts on top of the existing ones. As a result, the collapsedpolynomial has less than a quarter terms of the uncollapsedone.

Finally, after multiplying the last edge polynomial F5, ourmethod collapses even more terms since including e5 intro-duces new paths and cuts. We end up with only two termswhile the graph polynomial has 32 terms. The coefficient ofx∗ (i.e., 0.46875) in this polynomial is the reachability prob-ability we are aim to find. The coefficient of y∗ is simply 1minus the coefficient of x∗.

Further improvements in the algorithm. Collapsingthe terms of the edge aggregation polynomial reducesthe computational cost of the PReach algorithm greatly.Here, we make two optimizations that will cut down this


cost further by preprocessing the given network quickly.The first one will shrink the size of the input networkby filtering and merging edges whenever possible. Thesecond one will order the edge polynomials that gointo the polynomial multiplication to maintain a smallnumber of polynomial terms.

CONTRACT THE NETWORK. PReach requires a polyno-mial multiplication for each edge in the network. Thus,fewer edges implies fewer polynomial multiplications.We reduce the size of the input network in the prepro-cessing stage in three steps. The first step is the removalof nodes that are irrelevant to the problem, along withtheir edges. A node is irrelevant if (i) it is not a sourcenode and it has no incoming edges (i.e., is unreachablefrom a source node), or (ii) it is not a target node andit has no outgoing edges (i.e., no target node can bereached from it). The second step replaces a path thatconsists of a sequence of nodes that do not have anyother incoming or outgoing edges by a single edge.The probability of the new edge is the product of theprobabilities of the edges on the path it replaces. Afterthe second step, it is possible to have multiple edgeswith the same direction between pairs of nodes. Thethird step replaces multiple edges with the same startand end nodes by a single edge. The probability of thenew edge is the probability that at least one of the initialedges is present. More specifically, if the initial edgeshave probabilities p1, . . . , pk, the probability of the newedge is 1−(1−p1) . . . (1−pk). We repeat these three stepsuntil no further reduction is possible.

ORDER THE EDGES. The collapsing operator shrinksthe polynomial. Notice that for collapsing operator towork, the set indicator functions should evaluate toone. Our second improvement follows from these ob-servations. By carefully deciding the order in which theedge polynomials get multiplied, we can force the setindicator function to evaluate to one early in the poly-nomial multiplication. We have the following heuristicfor choosing the edge ordering. Consider each path orcut as the set of edges they contain. We order all such setsby their sizes in increasing order. We first multiply theedge polynomials in the smallest set. When we multiplyan edge polynomial, we remove that edge from all theremaining sets. We then repeatedly pick the next smallestset until we multiply all edge polynomials. This greedyapproach ensures that we collapse the graph polynomialas soon as possible. We observed experimentally that therunning time is improved at least by a factor of twoin most networks using the heuristic over random edgeorder.Time complexity of PReach. Let us denote the numberof edges in the probabilistic network before and aftercontracting the network with n and n′ (with n′ ≤ n). Atrivial upper bound for the time complexity of PReachis O(2n

′) as it requires multiplying n′ edge polynomi-

als. The actual time complexity is however significantlyless than this as PReach collapses polynomials, and itdepends on the network topology.

Let Ek be the set of first k edges whose polynomialsare multiplied. Let Ψk and Φk be the set of paths and cutsin Ek (i.e., Ψk = πi|πi ∈ Ψ, πi ⊆ Ek and Φk = κi|κi ∈Ψ, κi ⊆ Ek). The exact time complexity of aggregatingthe polynomials for the edges in Ek is

2k +∑C⊆Ψk

(−1)|C|2k−|∪πi∈Cπi| +∑C⊆Φk

(−1)|C|2k−|∪κi∈Cκi|.

(9)The formula follows by deriving the number of termscollapsed by each subset of paths and cuts. If there isno collapsing possible, then the polynomial will have 2k

terms. If there is a path πi, we save 2|k−πi| terms. If thereare more paths, we subtract the terms that we save forthe union of the paths. Applying the inclusion-exclusionprinciple, we subtract the terms that we save with eachpath, but we have to add back the terms that are com-mon for two paths, because we have subtracted themtwice, once for each path, then subtract paths commonto three paths, and so forth. The similar reasoning holdsfor the cuts.

To compute the complexity of finding all paths andcuts, let d be the maximum out-degree of a node in thecontracted network. Let σΨ and σΦ be the maximum sizeof a path and a cut in the contracted network respec-tively. The complexity of finding all paths is O(dσΨ). Thecomplexity of finding all minimal cuts is O(dσΦ).

As we demonstrate in the next section, PReach’s run-ning time is dominated by polynomial multiplicationand collapsing. We also show that PReach is many ordersof magnitude faster than the existing precise methodsand the gap between them increases with increasingnetwork size.

4 RESULTS

This section evaluates the performance of PReach.We experimented with signaling networks taken fromKEGG [29]. We have used all the signaling networks ofHomo sapiens (H. sapiens) from this database. We reportresults only for networks with more than 13 edges asthe problem is trivial for small networks. Table 2 liststhe details of the resulting data set. To experiment withnetworks of growing sizes we also combined subsets ofthese signaling networks. We have obtained the sourceand targets of each signaling network based on the hier-archical organization of the genes [15]. We set the genesat the top of the hierarchy as the source nodes and theones at the bottom as the target. We have extracted theconfidence scores for each interaction from STRING [57]and used them as edge probabilities. STRING has con-fidence values in the [1,1000] interval. We divided thisnumber by 1000 to obtain a number in the [0,1] intervalfor each interaction.

We compared PReach to the inclusion-exclusion algo-rithm of Ourfali et al. [45]. Although this method doesmore than solving the reachability problem, we compareagainst the inclusion-exclusion approach introduced init as the most recent exact solution available for the


TABLE 2Sizes of the signaling networks used in our experiments

Network nodes edges sources targetsWnt 56 65 13 13ErbB 46 55 10 17

MAPK 117 165 18 19Hedgehog 16 14 7 3Jak-STAT 22 27 2 9

mTOR 26 30 2 6TGF-beta 33 35 10 5

problem. The inclusion-exclusion algorithm did not scaleto networks with more than 40 paths. We thereforeimproved its running time dramatically by implement-ing our network contraction method (see Section 3.4)as a preprocessing step to it. In all our experiments,we compare against the improved implementation ofOurfali et al. [45]

We implemented both PReach and the inclusion-exclusion method in C++. We ran our experiments ona multiprocessor computer with 48 cores and 256 GBRAM and measured the running times of both methods.

4.1 Running time evaluationBoth PReach and inclusion-exclusion return the same re-sult when they run till completion as they both computethe reachability probability precisely. We first evaluatethe running times.

In this experiment, we compute the probability that atleast one target is reachable from a source. To do that,we created a hypothetical source node s that has an out-going edge to all the actual source nodes. Similarly, wecreated an artificial target node t with incoming edgesfrom all actual target nodes. We set the probabilitiesof all artificially created edges to 1. We then computedthe reachability probability from s to t. We considerednetworks of H. Sapiens obtained by combining differentKEGG pathways. Table 3 presents the results. The time topreprocess and contract the network as well as the timefor finding all paths was negligible (milliseconds), so wedo not report them. Results demonstrate that PReach isorders of magnitude faster than the competing method.Only on the two networks with the least number of pathsis the inclusion-exclusion method faster than PReach.The inclusion-exclusion method is exponential in thenumber of paths from source to destination. As a result,this method quickly becomes unfeasible as the numberof paths grows. In [45], Ourfali et al. were able to reportresults on the entire S. cerevisiae network by limiting thelength of their paths to three edges, and thus limiting thenumber of paths. This however leads to incorrect resultsif there are paths longer than three edges. When we ranthis method after limiting the path length to three, infive out of six experiments, it found incorrect results. Thepercentage of error was enormous; it varied between 6to 100%. On the other hand, paths and cuts help PReachcollapse the polynomial terms and cut down the run-ning time dramatically. Thus, a fundamental difference

<1

10

102

103

104

>105

<1 10 102 103 104 >105

I/E T

ime

[sec

]

PReach Time [sec]

Fig. 2. Predictability of the running time for PReach and the inclusion-exclusion

method. Each circle represents the running times for one network. The network

sizes vary between 58 and 62. x axis: running times for PReach (s). y-axis: running

time for the inclusion-exclusion method (s). Scales on both axes are logarithmic.

The circles at 105 s running time represent all running times greater than 105 s.

between PReach and the inclusion-exclusion method isthat growing number of paths rapidly hurts the perfor-mance of the latter, while it actually improves that ofthe former. Moreover, our method does not impose anarbitrary maximum path length, which may be difficultto determine and lead to incorrect conclusions, since themagnitude of the error cannot be properly estimated.

We observe this in our experiments; as the numberof edges grows, we expect the running time of PReachto grow. However, the paths and cuts help us collapsethe polynomial terms and cut down the running timedramatically.

The running times of both methods can grow rapidlyas they have time complexities exponential in differentterms. So their cost can grow highly when the networkgrows slightly. To observe how severe this problem is,we assessed running times of PReach relative to theinclusion-exclusion method by growing the network sizein a controlled fashion. To do this, we slightly grew anetwork by adding one node at a time in a way thatpreserves global properties, such as density and powerlaw distribution of node degrees. To add a new nodeto the network, we choose an existing node uniformlyat random and assign its in-degree to the new node.To connect the new node to the network, we randomlyselect nodes from the current network. The probabilitythat a node will be selected is proportional to the out-degree of that node. We add an incoming edge from eachchosen node to the new one. We add the outgoing edgesfrom the new node similarly. This way the power lawcharacteristic of the degree distribution in the networkis preserved. Following this procedure, starting with theWnt network of H. sapiens, which has 58 nodes, wegenerated 200 networks. 50 networks are generated byadding one node to the initial network, other 50 aregenerated from one of the networks obtained in the pre-vious step by adding one more node and so on. We ran


TABLE 3Network statistics and running time comparison between our method (PReach) and the inclusion-exclusion method (IE). Time is in seconds unless otherwise specified.

The values marked with ∼ are estimates we obtained from the number of paths after running IE for over one day. In parenthesis, we show the time to compute all cuts

for PReach, which is included already in the total running time of PReach.

Original Preprocessed Execution time (s)Network nodes edges paths nodes edges paths cuts IE PReach

WNT 58 91 66 10 19 12 88 0.001 0.03 (0.01)Hedgehog + Jak-STAT + mTOR 64 100 136 14 28 36 152 17238.43 1.75 (0.02)

WNT + Hedgehog 70 114 111 13 31 31 896 583.97 17.69 (1.3)WNT + mTOR 84 129 89 17 33 20 1408 0.29 57.31 (3.54)

WNT + TGF-beta 87 139 220 16 35 49 2064 ∼102 yrs 1151.23 (39.13)WNT + Jak-STAT 77 127 81 17 37 41 1384 ∼146 hrs 1805.45 (48.81)

both PReach and the inclusion-exclusion method on eachnetwork. We report the running times in Figure 2. Weobserve that, because of its exponentiality, the runningtimes for the inclusion-exclusion method vary wildly,spanning several orders of magnitude increase, even ifthe network sizes differ by four nodes at most, between58 and 62. By contrast, PReach finishes in 100 seconds inmost of the cases with much smaller variation in runningtime. In a small number of instances, the inclusion-exclusion method is faster than PReach. Typically, thishappens when there are only a small number of pathsfrom the source to target nodes. Nevertheless, PReachhas a small running time even for those data sets.

4.2 Evaluation on synthetic networks

We performed a detailed experimental evaluation of ourmethod using synthetic networks. They have the advan-tage that their characteristics can be controlled as theyare generated. In the case of biological networks, whichare known to be scale-free, we also have a generativemodel that approximates closely their characteristics, theBarabasi-Albert network model [68], [27]. We created sixsets of random networks using this model. In each set,we created 20 networks with the same number of nodes:15,17,19, 21, 23, 25, respectively. The only parameter ofthe model is the number of edges that is added per node,which dictates the density of the network. The integerclosest to the ratio found in real biological networks istwo, so we added two edges per node. We identified thesources and targets of each network using the vertex-sortalgorithm [28]. For each network, we applied PReachto find the probability that there is a path from anysource to any target, using dummy unique source andtarget nodes, as in Section 4.1. In Figure 3, we reportthe average running time for each size. As expected,the running times are exponential in the number ofedges. Nevertheless, the algorithm completes in 100seconds for the largest network considered. We alsofitted the exponential function axb to the experimentaldata, using a multiplicative error model. We obtaineda = 0.00006003456 and b = 0.568528. As we can see, theconstants are small, especially the coefficient in front ofthe exponential function, which gives further indicationof the improvement brought by our algorithm.

0.1

1

10

100

1000

30 35 40 45 50

Ave

rage

tim

e [s

ec]

Number of edges

Fig. 3. Running time of PReach on synthetic Barabasi-Albert networks ofincreasing size (logarithmic scale). The dots show measured times. The line showsthe fitted exponential function axb, where a = 0.00006003456 and b = 0.568528

4.3 Comparison of PReach and Bernoulli samplingmethod

Huang et al. [16] introduced a sampling approach thatcan be used to solve the problem. We implemented itas a solution for our problem as follows. From a givenprobabilistic network G = (V,E, P ), we draw a sampleof deterministic instances. We form each instance bystarting with an empty network and doing Bernoullitrials on each edge. We add the edges whose trialsresult in success to the instance. For each deterministicnetwork in the sample, we run a simple graph searchfor the target node starting from the source node. Wecalculate the approximate reachability probability of Gas the percentage of the deterministic sample where thetarget node was found reachable from the source node.

Let p be the true reachability probability of G, andc be the size of the sample. This sampling processis equivalent to a random variable X which is theaverage of c independent Bernoulli random variablesX1, . . . , Xc. We characterize the confidence level withthe error coefficient ε such that the standard deviation


of X at most equals ε · E(X). Therefore:

V ar(X) ≤ (ε · E(X))2

V ar(1

c

c∑i=1

Xi) ≤ (ε · E(1

c

c∑i=1

Xi))2

1

c2

c∑i=1

V ar(Xi) ≤ (ε · 1

c

c∑i=1

E(Xi))2

1

c2· c · p(1− p) ≤ (ε · 1

c· c · p)2 ⇔ c ≥ 1

ε2(1

p− 1)

(10)

This shows that the required sample size, and subse-quently the execution time, are quadratically inverselyproportional to the error coefficient ε, and linearly in-versely proportional to the reachability probability p.For example, for p = 0.8 and ε = 10−2, the requiredminimum sample size is 2, 500, while for p = 0.1 and ε =10−3, the required minimum sample size is 9, 000, 000.

We compared the performance of PReach with thesampling method using different data sets for differentp and ε values. The data sets we used are ErbB andMAPK signaling networks from KEGG, as well as anetwork constructed by mixing both of them. In thefirst experiment, we compute the average reachabilityprobability p from a single source to a single target. Weused interaction probabilities from STRING to computep using PReach. We also used this value to calculatethe minimum sample size c required for the samplingmethod. We measured the running time for PReach andfor the sampling method for different error coefficientvalues ε. Fig 4(a) shows the results. PReach performsbetter than sampling with ε = 10−1 for ErbB andMAPK, and better than sampling with ε = 10−3 for theircombined network.

In the second experiment, we measured the averagerunning time needed for a single sampling iteration(τ ). We fixed the error coefficient value ε = 10−2 andplotted 1

ε2 ( 1p − 1) · τ as the total time required for a

given p value. We compared this with the running timeof PReach, which does not change with changing p.Figure 4(b) shows the results. PReach performs betterthan the sampling method across all values of p for ErbBand MAPK, and for lower values of p for their combinednetwork.

It is worth highlighting that in both experiments werun both methods for finding single-source single-targetreachability probability. In cases of finding any-sourcesany-targets reachability probability, PReach takes moretime to find the answer, while the sampling time staysthe same. In conclusion, sampling is a good choice ifhigh confidence results are not required. However, evenin many of these cases, PReach returns a zero-error resultin the same or comparably higher time. Also, PReach isthe choice if the application dictates exact results, or ifonly very low error is tolerated.

1

10

10-2

10-1

102

103

104

10-1(Sampling) 10-2(Sampling) 10-3(Sampling) 0(PReach)

Tim

e [m

s]

Error coefficient (ε)

ErbBMAPK

ErbB+MAPK

(a)

1

10

10-2

10-1

102

103

104

105

106

110-3 10-2 10-1

Tim

e [m

s]

Reachability Probability (p)

ErbB+MAPKMAPK

ErbB

(b)

Fig. 4. Comparison of PReach versus sampling. (a) Running time for findingsingle-source single-target reachability probability using PReach and using sam-pling for different error coefficient (ε) values. (b) Running time for finding single-source single-target reachability probability using PReach and using sampling fora fixed error coefficient value ε = 10−2 and different reachability probability (p)values. For each data set, the curved line is sampling time and the straight line isPReach time.

4.4 Biological significance of the results

To assess the biological significance of our results, wehave compared the results obtained on the KEGG sig-naling networks with results obtained on randomly gen-erated networks. To generate these random networks,we start with an initial KEGG network and we shufflethe edges in a manner that preserves the node countand the degree of each node. At each step of the shuf-fling process, we select two edges uniformly at randomand switch the endpoints between the two edges, i. e.,edges (a, b) and (c, d) become (a, d) and (c, b). We repeatthis for 100, 000 times. For each KEGG network, eachsource node and each target node, we computed theprobability that the target node is reachable from thesource node. We ranked all source-target pairs by theircorresponding reachability probabilities and plotted theresulting graphs in Figure 5. We omitted from the plotsthe pairs with zero probability. The plots clearly showhigher reachability probability for the KEGG networks


0.4

0.5

0.6

0.7

0.8

0.9

1

20 40 60 80 100 120

prob

abili

ty

rank

originalshuffled

(a) ErbB

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 20 30 40 50 60 70

prob

abili

ty

rank

originalshuffled

(b) Wnt

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 40 60 80 100 120

prob

abili

ty

rank

originalshuffled

(c) MAPK

Fig. 5. Reachability probabilities in KEGG signaling networks and randomly generated networks. Left: ErbB; middle: Wnt; right: MAPK.

TABLE 4Top sources and targets according to PReach and the deterministic alternative in

ErbB, MAPK and Wnt. For PReach: most common sources and targets having

reachability probability greater than 0.9. For Deterministic: sources connected to

most targets and targets connected to most sources.

Top sources and targetsPReach Deterministic

Sources EREG, HBEGF HBEGF, BTC, TGFA,ErbB EGF, TGFA EGF, EREG

Targets STAT5, FAK, ELK1 All tiedMAPK Sources TNF, ILA1, GLK TNF, IL1A

Targets ELK1, p53, CREB NLK, ATF4Wnt Sources TAK1, PKA, PS-1 WNT5A, DKK1

Targets NLK, c-Myc ROCK1, PRKCA

compared to the randomly generated networks, espe-cially for the ErbB and Wnt pathways. For example, thepair ranked 40th in the ErbB network has a reachabilityprobability of close to 0.9, while the pair with the samerank in the corresponding randomly generated networkis disconnected. The gap between the plots for thereal and shuffled networks demonstrate that signalingnetworks are heavily biased towards enabling pathwaysfrom receptors to reporters to maximize the chances thatsignals can travel to reporters. Interestingly, the gap be-tween the probability values in the MAPK network andits random counterpart is less wide. This observationcomplies with the fact that the average in-degree ofMAPK reporters is 1.68, while it is much larger for ErbBand Wnt. This suggests that MAPK signaling network ismore robust and less sensitive to perturbations as longas the node degrees are preserved.

We have further investigated the results of this experi-ment by examining the most common source and targetproteins in the pairs that achieved reachability proba-bility greater than 0.9. We also devised an intuitive de-terministic alternative for measuring reachability. Usingthe underlying deterministic version of each network,we count the number of targets that a given source canreach to, as well as the number of sources that can reachto a given target. We also list the top sources and targetsaccording to this metric. Table 4 lists such proteinsfor each examined network. Among the target proteinswith more than 0.9 reachability probability in ErbB isSTAT5. It plays an important role in the development

TABLE 5Common GO terms for targets having reachability probability greater than 0.9 in

ErbB, MAPK and Wnt.

Network GO ID Term name Frequency

ErbB GO:0000209 protein polyubiquitination 1GO:0007087 mitotic nuclear pore com-

plex reassembly0.88

MAPK

GO:0000209 protein polyubiquitination 0.79GO:0003399 cytoneme morphogenesis 0.74GO:0006268 DNA unwinding involved

in replication0.53

GO:0031265 CD95 death-inducing sig-naling complex

0.53

GO:0042598 vesicular fraction 0.89

Wnt GO:0000209 protein polyubiquitination 0.85GO:0042598 vesicular fraction 0.92

and function of B- and T-lymphocytes, which are themajor cellular components of the adaptive immune re-sponse [21]. One of the highest reachable target proteinsin MAPK is CREB, which has a well-known role incell proliferation, differentiation and survival, as well asspecific functions in immune responses [61]. In Wnt, NLKis found among the highest reachable target proteins,which positively regulates Wnt/β-catenin signaling byphosphorylating LEF1 in neural progenitor cells [44].The top ErbB sources found by PReach were also foundby the deterministic method. However, the deterministicmethod produces a tie between all ErbB targets, pro-viding no ranking information. In cases of MAPK andWnt, PReach produces top sources and targets that aredifferent than the ones produced by the deterministicmethod. This provides new insights about proteins ofpotentially important roles, which cannot be found bythe deterministic method.

We also extracted the GO annotations that are commonto the target proteins that achieve more than 0.9 reach-ability probability. From the list of source-target pairsthat meet this criterion, we select the three most commontargets, that is, the targets that appear most frequently inthe list, unless there were less than three targets in total.Then we found the GO terms that are common to allthese targets. We then eliminated nonspecific GO terms(i,e., GO-terms that are too high in the GO-hierarchy)in order to avoid having a positive bias towards ourmethod. Such elimination strategy is used in the lit-


erature for the same purpose (see [55], [58]). We alsoquantified the significance of these terms by computingthe fraction of the target proteins associated with eachterm in all target proteins. Table 5 lists the annotationsand their frequency for each examined network.

One of the common annotations of MAPK highest-reachable target nodes is cytoneme morphogenesis(GO:0003399). This signifies a role in shaping theanatomical structure of cytonemes, which connectadjacent cells to enable transfer of surface-associatedcargoes from cell to cell[53]. Among the commonannotations for ErbB highest-reachable target nodesis mitotic nuclear pore complex reassembly (GO:0007087).In the context of mitotic cell division, this signifies arole in reforming nuclear pores, which facilitate themovement of macro-molecules between the nucleus andcytoplasm [4].

We compared the GO annotations we obtained inTable 5 with GO terms for targets that can be reachedfrom most sources in the deterministic version of thethree networks considered. We could not extract the topGO annotations for ErbB targets because all the targetsare tied in ranking. For MAPK and Wnt, PReach outputsGO terms that are different than the ones output bythe deterministic method. This points out different areasof investigation about more important functions of eachnetwork. The only annotation common to all networks,both in the probabilistic and deterministic versions, isprotein polyubiquitination (GO:0000209). Polyubiquitina-tion is the formation of a ubiquitin chain on a single ly-sine residue on the substrate protein. Ubiquitin is a smallprotein that exists in all eukaryotic cells. It performsa vast multitude of functions through conjugation to alarge range of target proteins, therefore its pervasivenesson our results is not surprising.

TABLE 6Common GO terms for targets that can be reached from most sources in the

deterministic version of the ErbB and MAPK for the experiments shown in Table 5.

Network GO ID Term name

MAPK

GO:0000209 protein polyubiquitinationGO:0005570 small nuclear RNAGO:0006271 strand elongation in DNA replicationGO:0042651 thylakoid membrane

Wnt

GO:0000209 protein polyubiquitinationGO:0000038 long-chain fatty acid metabolic processGO:0000162 tryptophan biosynthetic processGO:0004137 deoxycytidine kinase activityGO:0004549 tRNA-specific ribonuclease activityGO:0004938 alpha2-adrenergic receptor activityGO:0004994 somatostatin receptor activityGO:0005364 maltose:hydrogen symporter activityGO:0005388 calcium-transporting ATPase activityGO:0005570 small nuclear RNAGO:0006716 juvenile hormone metabolic processGO:0034081 polyketide synthase complexGO:0016250 N-sulfoglucosamine sulfohydrolase activGO:0016707 gibberellin 3-beta-dioxygenase activityGO:0042651 thylakoid membraneGO:0043714 (R)-citramalate synthase activity

5 CONCLUSION AND FUTURE WORK

We have presented a fast and exact method for comput-ing the reachability probability in probabilistic networks,where each edge represents an actual interaction witha different probability. Our method, PReach, relies ona theory we developed around a special class of poly-nomials, the xy-polynomials, used as a representationfor probabilistic networks. The biological significance ofPReach is validated by experimentation. Other methodsthat have considered this problem in the context ofsignaling networks, based on the inclusion-exclusionprinciple, are far behind our method in terms of com-putational cost.

PReach has numerous potential biological applicationson multiple levels. On the node level, a new notion ofnode centrality in probabilistic networks can be devisedby measuring the difference in reachability probabilityif the node is removed. On the network level, networkrobustness can be assessed based on difference in reacha-bility probability after changing interaction probabilitiesand/or network topology. On the level of collective func-tions, reachability probability of a biological functioncan be measured as that of the targets annotated bycorresponding functional annotation. As a result, wecan tell which functions are more important (i.e., moresupported) than others in a given network.

Further improvements to the algorithm are possible,suitable especially for large input networks. In suchcases, a divide and conquer strategy can be adopted,similar to the method used in [2]. We can partition theinitial network in two loosely connected clusters andsolve the problem for each cluster. The looser the compo-nents are connected between themselves, the easier it isto merge the solutions. This can yield important benefitsin terms of running time. We plan to explore this furtherimprovement in future work.

REFERENCES

[1] K. K. Aggarwal, K. B. Misra, and J. S. Gupta. Reliability evaluationA comparative study of different techniques. MicroelectronicsReliability, 1975.

[2] F. Ay, G Gulsoy, and T. Kahveci. Finding Steady States of LargeScale Regulatory Networks through Partitioning. GENSIPS, 2005.

[3] J. S. Bader, A. Chaudhuri, et al. Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol., 2004.

[4] K. K. Bodoor et al. Function and assembly of nuclear porecomplex proteins. Biochemistry and cell biology, 1999.

[5] D. B. Brown. A computerized algorithm for determining thereliability of redundant configurations. IEEE Trans. on Reliability,1971.

[6] J. I. Brown and C. J. Colbourn. Non-Stanley Bounds for NetworkReliability. Journal of Algebraic Combinatorics, 1996.

[7] A. Ceol, A. Chatr Aryamontri, et al. MINT, the MolecularINTeraction database: 2009 update. Nucleic Acids Res., 2010.

[8] M. C. Easton and C. K. Wong. Sequential Destruction Method forMonte Carlo Evaluation of System Reliability. IEEE Transactionson Reliability, 1980.

[9] S. Fields and O. Song. A novel genetic system to detect protein-protein interactions. Nature, 1989.

[10] G. S. Fishman. A Monte Carlo Sampling Plan for EstimatingNetwork Reliability. Operations Research, 1986.

[11] Y. Fu. Application of topological methods to probabilistic net-works. IEEE Trans. Commun. Technol., 1965.


[12] Y. Fu and S. S. Yau. A note on the reliability of communicationnetworks. SIAM J. Appl. Math., 1962.

[13] Haitham Gabr, Andrei Todor, Helia Zandi, Alin Dobra, andTamer Kahveci. PReach: Reachability in Probabilistic SignalingNetworks. Technical Report REP-2013-564, CISE, University ofFlorida, Gainesville, Florida, April 2013.

[14] A. C. Gavin, M. Bosche, et al. Functional organization of the yeastproteome by the systematic analysis of protein complexes. Nature,2002.

[15] G. Gulsoy, N. Bandhyopadhyay, and T. Kahveci. HIDEN: Hierar-chical decomposition of regulatory networks. BMC Bioinformatics,2012.

[16] Hailiang H., Lan V. Z., et al. Probabilistic Paths for Protein Com-plex Inference. In Systems Biology and Computational Proteomics,2006.

[17] E. Haensler, G. K. McAuliffe, and R. S. Wilkov. Exact calculationof computer network reliability. Fall Joint Computer Conference,1972.

[18] E. Haensler, G. K. McAuliffe, and R. S. Wilkov. Optimizing thereliability In centralized computer networks. IEEE Transactions onCommunications, 1975.

[19] C. Hanzhong. A new algorithm for computing the reliability ofcomplex networks by the cut method. Microelectronics Reliability,1994.

[20] S. A. Hasanuddin. Simple enumeration of minimal cutsets ofacyclic directed graph. IEEE Trans. Reliability, 1988.

[21] L. M. Heltemes-Harris, M. J. L. Willette, et al. The role of STAT5in the development, function, and transformation of B and Tlymphocytes. Annals of the New York Academy of Sciences, 2011.

[22] F. E. Hohn and L. R. Schissler. Boolean matrices and the design ofcombinational relay switching circuits. Bell Systems Tech. J., 1955.

[23] T. Husfeldt and N. Taslaman. The Exponential Time Complexityof Computing the Probability That a Graph is Connected. Inter-national Symposium on Parameterized and Exact Computation, 2010.

[24] G. L. Hwang, T. A. Tillman, and M. H. Lee. System-ReliabilityEvaluation Techniques for Complex/Large Systems-A Review.IEEE Trans. on Reliability, 1981.

[25] S. P Jain and K. Gopal. An efficient algorithm for computingglobal reliability of a network. IEEE Trans. Reliability, 1988.

[26] G. B. Jasmon and S. Kai. A new technique in minimal path andcutset evaluation. IEEE Trans. Reliability, 1985.

[27] H. Jeong, B. Tombor, et al. The large-scale organization ofmetabolic networks. Nature, 407(6804):651–654, October 2000.

[28] R. Jothi, S. Balaji, et al. Genomic analysis reveals a tight linkbetween transcription factor dynamics and regulatory networkarchitecture. Mol. Syst. Biol., 2009.

[29] M. Kanehisa, S. Goto, et al. The KEGG resource for decipheringthe genome. Nucleic Acids Res., 2004.

[30] R. M. Karp and M. G. Luby. A New Monte Carlo Method forEstimating the Failure Probability of An N-component System.EECS TR UCB/CSD-83-117, University of California, Berkley, 1983.

[31] E.K. Kim and E.C. Choi. Pathological roles of MAPK signalingpathwaysin human diseases. Biochimica et Biophysica Acta (BBA)-Molecular Basis, 2010.

[32] H. Kumamoto, K. Tanaka, and K. Inoue. Efficient evaluationof system reliability by Monte Carlo method. IEEE Trans. onReliability, 1977.

[33] M. Laplante and D. M. Sabatini. mTOR signaling in growthcontrol and disease. Cell, 2012.

[34] M. O. Locks. Inverting and minimalizing path sets and cut sets.IEEE Trans. Reliability, 1978.

[35] C. Lucet and J-F. Manouvrier. Exact methods to compute networkreliability. Mathematical Methods in Reliability, 1997.

[36] B.T. MacDonald, K. Tamai, and X. He. Wnt/β-Catenin Signaling:Components, Mechanisms, and Diseases. Developmental Cell, 2009.

[37] A. Martelli. A gaussian elimination algorithm for the enumerationof cut sets in a graph. J. Assoc. Comp. Machinery, 1976.

[38] H. Mine. Reliability of physical systems. IRE Trans. Circuit Theory,1959.

[39] K. B. Misra. An algorithm for the reliability evaluation ofredundant networks. IEEE Trans. Reliability, 1970.

[40] K. B. Misra and T. S. M Rao. Reliability analysis of redundantnetworks using flow graphs. IEEE Trans. Reliability, 1970.

[41] S. Mizuno, R. Iijima, et al. AlzPathway: a comprehensive map ofsignaling pathways of Alzheimer’s disease. BMC Systems Biology,2012.

[42] F. Moskowitz. The analysis of redundancy networks. AIEE Trans.(Commun. Electron.), 1968.

[43] S. Okada. Topology applied to switching circuits. Proc. Symp. onInformation Networks, Polytechnic Institute of Brooklin, 1954.

[44] S. Ota, S. Ishitani, et al. NLK positively regulates Wnt/β-cateninsignalling by phosphorylating LEF1 in neural progenitor cells.EMBO J, 2012.

[45] O. Ourfali, Shlomi T., et al. SPINE: a framework for signaling-regulatory pathway inference from cause-effect experiments.Bioinformatics, 2007.

[46] J. Oxley and D. Welsh. Chromatic, Flow, and Reliability Polynomi-als: the Complexity of their Coefficients. Combinatorics, Probabilityand Computing, 2012.

[47] M. Piekarski. Listing all possible trees of a linear graph. IEEETrans. Circuit Theory, 1965.

[48] J. S. Provan and M. O. Ball. The Complexity of Counting Cuts andof Computing the Probability that a Graph is Connected. SIAMJournal of Computing, 1983.

[49] J. S. Provan and M. O. Ball. Computing network reliability intime polynomial in the number of cuts. Operations Research, 1984.

[50] S. Rai and K. K. Aggarwal. An efficient method for reliabilityevaluation of a general network. IEEE Trans. Reliability, 1978.

[51] C Re and D Suciu. Efficient evaluation of HAVING queries ona probabilistic database. Proc. Database Programming Languages,2007.

[52] Y. Shen. A new simple algorithm for enumerating all minimalpaths and cuts of a graph. Microelectronics Reliability, 1995.

[53] N. M. Sherer and W. Mothes. Cytonemes and tunneling nan-otubules in cellcell communication and viral pathogenesis. Trendsin Cell Biology, 2008.

[54] D. R. Shier and D. E. Whited. Iterative algorithms for generatingminimal cutsets in directed graphs. Networks, 1986.

[55] R. Singh, J. Xu, and B. Berger. Global alignment of multiple pro-tein interaction networks with appication to functional orthologydetection. PNAS, 2008.

[56] A. D. Sokal. The multivariate Tutte polynomial (alias Potts model)for graphs and matroids. Surveys in Combinatorics, 2010.

[57] D. Szklarczyk, A. Franceschini, et al. The STRING databasein 2011: functional interaction networks of proteins, globallyintegrated and scored. Nucleic Acids Res., 2011.

[58] A. Todor, A. Dobra, and T. Kahveci. Probabilistic BiologicalNetwork Alignment. TCBB, 2013.

[59] S. Tsukiyama, I. Shirakawa, and H. Ozaki. An algorithm toenunerate all cutsets of a graph in linear time per cutset. J. Assoc.Comp. Machinery, 1980.

[60] C. von Mering, R. Krause, et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 2002.

[61] A. Y. Wen, K. M. Sakamoto, and L. S. Miller. The Role of theTranscription Factor CREB in Immune Function. The Journal ofImmunology, 2010.

[62] R. S. Wilkov. Reliability considerations in computer networkdesign. Proc. IFIP Congress ’71, Ljubljana, 1971.

[63] R. Wilkow. Analysis and design of reliable computer networks.IEEE Transactions on Communications, 1975.

[64] O. Wing and P. Demetriou. Analysis of probabilistic networks.IEEE Trans. Commun. Technol., 1964.

[65] O. Wing and W. H. Kim. The path matrix and switching functions.J. Franklin Inst., 1959.

[66] W. Xiao, M. N. Mindrinos, et al. A genomic storm in criticallyinjured humans. The Journal of experimental medicine, 2011.

[67] L. Yan, H. A. Taha, and T. L. Lander. A recursive approach forenumerating minimal cutsets in a network. IEEE Trans. Reliability,1994.

[68] SH. Yook, Z. N. Oltvai, and A. L. Barabasi. Functional andtopological characterization of protein interaction networks. Pro-teomics, 4(4):928–942, 2004.

[69] K. Zhu, W Zhang, et al. BMC: An Efficient Method to Evaluate theProbabilistic Reach ability Queries. Database Systems for AdvancedApplications, 2011.


Haitham Gabr Haitham Gabr received his B.S.degree in Computer and Systems Engineering fromAlexandria University, Egypt in 2007. He worked asa software engineer at eSpace Technologies for fouryears, where he specialized in Web 2.0 solutions. In2011, he received the University of Florida AlumniFellowship, and is currently a Ph.D. student in theComputer and Information Science and EngineeringDepartment at the University of Florida. His main re-search interests are bioinformatics and computationalbiology in general, probabilistic biological networks in

particular.

Andrei Todor received his B.S. degree in Com-puter Engineering from the Technical University of Cluj-Napoca, Romania in 2004 and his M. S. degree in Com-puter Science from Stony Brook University (SUNY) in2009. He is currently a Ph. D. student in the Computerand Information Science and Engineering Departmentat the University of Florida. His main research interestsare bioinformatics and databases, in particular proba-bilistic biological networks and databases.

Alin Dobra received his B.S. degree in Computer En-gineering from the Technical University of Cluj-Napoca,Romania in 1998 and his Ph.D in Computer Sciencefrom Cornell University in 2003. He is currently anAssociate Professor in the Computer and InformationScience and Engineering Department at the Universityof Florida. Dr. Dobra received an NSF CAREER awardin 2005 and a SIGMOD 2007 best paper award. Hismain research interests are in large scale databasesystems, approximate query processing and probabilis-tic models.

Tamer Kahveci received his B.S. degree in Com-puter Engineering and Information Science from BilkentUniversity, Turkey in 1997, and his Ph.D. degree inComputer Science from University of California atSanta Barbara in 2004. He is currently an AssociateProfessor in the Computer and Information Science andEngineering Department at the University of Florida.Dr. Kahveci received the Ralph E. Powe Junior FacultyEnhancement award in 2006, CSB best paper award in2008, the NSF Career award in 2009, the ACM-BCB(Bioinformatics and Computational Biology) and the

best student paper award in 2010. His main research interests are bioinformaticsand databases. He has worked on indexing and querying sequence and proteinstructure and biological network databases.

Documents

JOURNAL OF LA Reachability analysis in probabilistic ...atodor/reachability_journal.pdf · JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1 Reachability analysis in probabilistic