IEEE TRANSACTIONS ON DEPENDABLE AND SECURE …mcn.cse.psu.edu/paper/other/tdsc-tati15.pdf · with fewer false positives. The average decrease in the false positives due to the addition

netCSI: A Generic Fault Diagnosis Algorithm forLarge-Scale Failures in Computer Networks

Srikar Tati, Scott Rager, Bong Jun Ko, Guohong Cao, Fellow, IEEE, Ananthram Swami, Fellow, IEEE, and

Thomas La Porta, Fellow, IEEE

Abstract—We present a framework and a set of algorithms for determining faults in networks when large scale outages occur. The

design principles of our algorithm, netCSI, are motivated by the fact that failures are geographically clustered in such cases. We

address the challenge of determining faults with incomplete symptom information due to a limited number of reporting nodes. netCSI

consists of two parts: a hypotheses generation algorithm, and a ranking algorithm. When constructing the hypothesis list of potential

causes, we make novel use of positive and negative symptoms to improve the precision of the results. In addition, we propose pruning

and thresholding along with a dynamic threshold value selector, to reduce the complexity of our algorithm. The ranking algorithm is

based on conditional failure probability models that account for the geographic correlation of the network objects in clustered failures.

We evaluate the performance of netCSI for networks with both random and realistic topologies. We compare the performance of

netCSI with an existing fault diagnosis algorithm, MAX-COVERAGE, and demonstrate an average gain of 128 percent in accuracy for

realistic topologies.

Index Terms—Fault diagnosis, large-scale network failures, incomplete information, clustered failures

Ç

1 INTRODUCTION

EFFICIENTLY detecting, diagnosing, and localizing faultynetwork elements [1], [2], [3] (e.g., network nodes, links)

are critical steps in managing networks. The main focus ofconventional fault detection and diagnosis techniques [1],[2] is to identify faults that are independent in nature, i.e.,caused by individual component failures (e.g., routing soft-ware malfunction, broken links, adapter failure, etc.). In thispaper, we concentrate on handling massive failures (or“outages”) of many network elements caused by, for exam-ple, weapons of mass destruction (WMD) attacks [4], natu-ral disasters [5], [6], electricity blackout, cyber attack, etc.

The type of outages caused by massive failures differgreatly from those caused by typical equipment faults [1],[2], [7]. Massive outages tend to create faults at multiplecomponents that are geographically close to each other. Wecall these failures clustered failures. Until now, the priorwork in the area of fault diagnosis has focused on indepen-dent failures [1], [3], [7]. The performance of these algo-rithms degrades when applied to clustered failures. In thispaper, we propose netCSI, a new algorithm that is designedto effectively identify faulty network components underclustered failures. To show the benefits of our algorithm, wecompare it with an existing algorithm that is proposed forindependent failures.

netCSI determines possible causes of large-scale failuresusing a knowledge base and end-to-end symptom information.The knowledge base contains information about possiblepaths between different source-destination pairs and theinferred topology of the network. The end-to-end symptomsreflect end-to-end connectivity or disconnectivity in the net-work and are observed when a failure occurs. These symp-toms include both negative information, such as whichsource-destination pairs are disconnected, as well as positiveinformation, such as which source-destination pairs can stillcommunicate.

Once a clustered failure occurs, netCSI uses the knowl-edge base and symptoms to generate a list of possiblecauses of the outage, called the hypothesis list. Then, a rank-ing algorithm is applied to the hypothesis list to rate thepossible causes.

The main assumption in the existing fault diagnosis algo-rithms [1], [2], [7] is that complete and accurate informationis available at the network manager. However, duringlarge-scale failures, it is very unlikely that complete end-to-end symptom information will be available, because report-ing nodes may not be able to reach the network manager.Hence, we specifically address the issue of incomplete end-to-end symptoms, because of a limited number of reportingnodes, or the inability of some of the reporting nodes toreport all the symptoms. Note that netCSI considers theissue of incomplete symptoms and the aspect of positiveinformation at the same time, unlike Steinder and Sethi [8],who consider these aspects separately.

The following are our key contributions in this paper:

� Utilizing positive information. Unlike existing faultdiagnosis algorithms [1], [7], both connectivity anddisconnectivity information are considered in netCSIinstead of only disconnectivity information. This ena-bles netCSI to generate a more precise hypothesis list

� S. Tati, S. Rager, G. Cao, and T. La Porta are with the Pennsylvania StateUniversity. E-mail: {tati, str5004, gcao, tlp}@cse.psu.edu.

� B. J. Ko is with the IBM TJ Watson Research Centre.E-mail: [email protected].

� A. Swami is with Army Research Laboratory.E-mail: [email protected].

Manuscript received 27 Aug. 2012; revised 12 June 2014; accepted 30 Sept.2014. Date of publication 9 Nov. 2014; date of current version 18 May 2016.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TDSC.2014.2369051

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 13, NO. 3, MAY/JUNE 2016 355

1545-5971� 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

with fewer false positives. The average decrease inthe false positives due to the addition of positivesymptoms is 64 percent in our simulation study.

� Coping with lack of complete information. We under-stand the impact of partial symptom information onfailure diagnosis. While generating the hypothesislist of possible causes, netCSI performs a type ofexhaustive search, in which we introduce severaloptimization techniques that significantly decreasethe run-time. Because of this, netCSI is able to gener-ate much more accurate results than other well-known heuristic-based algorithms, such as MAX-COVERAGE [1], when there are few reportingnodes. In addition, we also look at the case whensome of the reporting nodes do not have completeinformation. In our simulation performed on a realis-tic network topology, the average gain in the accu-racy of netCSI over MAX-COVERAGE is 128 percentwith limited reporting nodes.

� Conditional failure probability models (CFP). Wedevelop conditional probability models based oncharacteristics of massive failures and the availabletopology of the network. These models are used inour ranking algorithm to enable accurate fault locali-zation under clustered failures.

First, we review related work in Section 2. We describeour problem in Section 3, and explain netCSI and the rank-ing algorithm in Sections 4 and 5 respectively. We give ananalysis of run-time in Section 6. We evaluate the perfor-mance of netCSI in Section 7. Finally, we provide conclu-sions in Section 8.

2 RELATED WORK

There is much prior work [1], [3], [7] that address the locali-zation of independent failures of individual components inthe network. To the best of our knowledge, our algorithm isthe first to focus on diagnosing massive failures and chal-lenges posed by them. However, there is considerable inter-est on other aspects of large-scale failures in the currentliterature [9], [10].

Network tomography [11], [12] is a research area that hasevolved over the past few years. The main focus of networktomography is inferring link-level properties from end-to-end measurements and network topology. One type of net-work tomography technique called Binary network tomog-raphy [13] deals with links that are only either good or bad,and is close to our problem. Most solutions in this area [14],[15] rely on large linear systems that represent the relationbetween path and link properties, which is fundamentallyunderconstrained, i.e., there exist links in the system whoseproperties or status cannot be determined uniquely. Someof the approaches [16], [17] try to solve this issue by employ-ing statistical techniques that are based on maximum likeli-hood methods etc., but these still come short on accuracywhen determining link states. On the contrary, netCSI doesnot solve a linear system of equations, but instead generatespossible causes considering all links in the network.

In addition, network tomography assumes that there is aunique path between a source and destination pair in thenetwork. As explained in [18], this could be a huge problem

while diagnosing failures, when either rerouting occurs orthere is possibility of dynamic topology in computer net-works. Unlike tomography, netCSI considers all knownmultiple paths between a source and destination pair. More-over, there is no work in network tomography that dealswith large-scale failures, where a group of nodes failtogether simultaneously, however there is a recent workthat focuses on correlation in links while inferring the indi-vidual metrics [19].

MAX-COVERAGE [1], a fault diagnosis algorithm basedon SCORE (space correlation engine) [7], focuses on MPLS-over-IP backbone networks. Localization agents use theend-to-end connectivity measurements as alarms andemploy spatial correlation techniques to isolate failures.SCORE [7] is a greedy approach to localize optical link fail-ures using only IP-layer event logs (IP link faults). The per-formance of these algorithms degrade for large scalefailures under incomplete information. This problem ofincomplete information is precisely addressed by netCSI.Furthermore, netCSI makes use of both negative and posi-tive symptoms, unlike MAX-COVERAGE and SCORE.

Sherlock [3], Shrink [2], and Steinder and Sethi [8] pro-pose fault diagnosis techniques that use combinatorialalgorithms with exponential computational complexitylike netCSI. Sherlock and Shrink try to minimize the com-plexity by restricting the number of objects that areassumed to have failed, which introduces false negatives.However, they do not provide any mechanism to decidethe number of objects that have to be restricted. We pro-pose a dynamic threshold value selector (see Section 4.2.2),which uses a Bayesian approach to select the threshold.netCSI also uses a novel pruning technique to effectivelyreduce run-time without sacrificing accuracy. Further-more, both Sherlock and Shrink assume complete symp-toms, whereas netCSI performs well with incompletesymptom information.

3 PROBLEM AND MOTIVATION

Our goal is to determine the list of faulty objects in a net-work as accurately as possible in the case of clustered fail-ures, given a knowledge base comprising of networktopology and end-to-end symptoms from a limited numberof reporting nodes. We describe this system model andexplore this problem further via an example in the follow-ing sections.

3.1 System Model

We consider a network with n nodes, represented by the setN , and l links, represented by the set L. We define the termobjects to represent either nodes or links. The set ofp ¼ ðnþ lÞ objects in the network is denoted by O.

Each node communicates with destination nodes overdifferent possible routes that are determined by routingalgorithm.1 These routes are loop-free and may not be equalto all possible routes in a given topology. Source nodes arerepresented by S ¼ fs1; s2; s3; . . . ; smg � N . Different routesfrom a source node si 2 S to k different destinations,

1. Our fault diagnosis algorithm is agnostic of the routing algorithmwhich is chosen by the network manager.

356 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 13, NO. 3, MAY/JUNE 2016

denoted by Di ¼ fd1; d2; . . . ; dkg, are available in the knowl-edge base. The routes from source node si to a destination

node dj 2 Di are given by Xi;j ¼ fx1i;j; x

2i;j; . . .g. These routes

are finite, and this number is different for each source-desti-nation pair. The qth route, xq

i;j, is stored as the list of objects

in O that are present along the route. If any of the objectsalong a route is faulty, then that route is disconnected; oth-erwise it is considered a connected route.

netCSI uses the end-to-end symptom information of thesources’ connectivity and dis-connectivity to their respec-tive destination sites. A destination site dj 2 Di is regardedas connected to si if there exists at least one connected routein Xi;j, and disconnected if all the routes inXi;j have at leastone fault. This symptom information of a source si is givenby the set I i ¼ fI i;1; I i;2; I i;3; . . . ; I i;kg, where the elementsin the set correspond to the k destination nodes given in Di.I i;j is defined only for dj 2 Di, and is equal to either 1 or 0as follows,

I i;j ¼ 1; if destination dj is connected to si0; if destination dj is disconnected to si

�

The network manager collects these symptoms, Ii, froma reporting node si 2 S. We assume that the collectedsymptoms are accurate, and have no errors. In large-scalefailures, getting symptom information from every report-ing node in the network is a major challenge. In addition,the collection of symptom information involves high over-head in terms of network management traffic. Note thatnetCSI could diagnose both node and link failuresdepending on the objects that have been specified inroutes by the network manager. In the next section, weconsider only nodes in the routes, however in our simula-tions (Section 7.1), we consider links in the network as theobjects in routes.

3.2 Motivating Example

Consider the network in Fig. 1, in which there are twosources, s1 and s2, each connected to two destinationnodes, d1 and d2. The possible routes for all source-desti-nation pairs are given in a table in Fig. 1. In this exam-ple, we only consider the intermediate nodes as objectsin the network. Suppose we are given the followingsymptoms: source node s1 is disconnected from both thedestination nodes d1 and d2, and source node s2 is dis-connected from destination node d1, but is connected todestination node d2.

First consider the case in which we look only at the set ofnegative symptoms, fI1;1 ¼ 0; I1;2 ¼ 0; I 2;1 ¼ 0g. In thiscase, there are eight possible causes of symptoms as shownin Table 1. Up to this point, we cannot say that any particu-lar router is down with complete confidence because of the

ambiguity in the possible causes deduced from the givennegative symptoms. However, if we know that routers R1

and R3 are geographically close to each other, for instancein the same building, we can consider the double failureðR1; R3Þ as the most likely possible cause of the observedsymptoms.

If, in addition to negative information, we also have thepositive information from source nodes s1 and s2 (in thiscase I2;2 ¼ 1), we can infer three possible causes as shownin Table 1. From these causes, we can deduce that routerR3 must be down and router R4 must be operating cor-rectly. If we again know that routers R1 and R3 are close toeach other, then we can prioritize the double failureðR1; R3Þ as the most likely possible cause. As seen in thisexample, the addition of positive information decreasesthe number of possible causes. Moreover, with the knowl-edge of geographical topology information, we can rankthe possible causes of symptoms to enable more accuratelocalization.

In addition, if we compare the cases where positiveand negative symptoms is acquired from only s1 (incom-plete symptoms) versus both the source nodes s1 and s2(complete symptoms) in Table 1, the latter case results infewer possible causes. In other words, having symptomsfrom more reporting nodes generally results in fewerfalse positives and higher accuracy. However, the reduc-tion in false positives comes with added overhead in theform of extra network management traffic. Therefore,this trade-off is important when there are limited report-ing nodes in the network and is discussed in more detailin Section 7.4.3.

4 NETCSI:HYPOTHESES GENERATION ALGORITHM

AND REFINEMENTS

Given the knowledge base and end-to-end symptoms of thenetwork as input, we propose a hypotheses generation algo-rithm that outputs possible combinations of faulty objects

Fig. 1. A network scenario to illustrate the algorithm.

TABLE 1Possible Causes of Given Symptoms

No. of failedobjects

Negative Informationfrom s1

Positive and NegativeInformation from s1

1-object ðR1Þ ðR1Þ2-objects ðR1; R2Þ; ðR2; R3Þ; ðR1; R3Þ ðR1; R2Þ; ðR2; R3Þ;

ðR1; R3Þ3-objects ðR1; R2; R3Þ ðR1; R2; R3ÞNo. of failedobjects

Negative Informationfrom s2

Positive and NegativeInformation from s2

1-object ðR4Þ; ðR3Þ ðR3Þ2-objects ðR4; R3Þ ðÞNo. of failedobjects

Negative Informationfrom s1 and s2

Positive and NegativeInformation from s1and s2

1-object ðÞ ðÞ2-objects ðR1; R3Þ; ðR1; R4Þ; ðR2; R3Þ ðR1; R3Þ; ðR2; R3Þ3-objects ðR1; R2; R3Þ; ðR1; R2; R4Þ ðR1; R2; R3Þ

ðR1; R3; R4Þ; ðR2; R3; R4Þ4-objects ðR1; R2; R3; R4Þ ðÞ

TATI ETAL.: NETCSI: A GENERIC FAULT DIAGNOSIS ALGORITHM FOR LARGE-SCALE FAILURES IN COMPUTER NETWORKS 357

that can cause the observed failures. We then introduce sev-eral heuristics to decrease the run-time of the hypothesesgeneration algorithm.

4.1 Hypotheses Generation Algorithm

Step I. A list of possible faulty objects is constructed in thisstep. The list is comprised of all unique objects in routesfrom different sources (reporting nodes) to the correspond-ing disconnected nodes. We define a set with all possiblefaulty objects, represented by P. Consider all source-desti-nation pairs with negative information, i.e., I si;dj ¼ 0 for a

ðsi; djÞ pair, with dj 2 Di. The unique objects in all possibleroutes between different source and destination pairs areadded to the faulty object list. The sample faulty object listfor the example shown in Fig. 1 and the symptoms men-tioned in Section 3.2 is ðR1; R2; R3; R4Þ. Step I is imple-mented in lines 2-11 of the pseudo-code shown in Fig. 2.

Step II. Every combination of possible faulty objects in theformulated list is checked for the given symptoms utilizingthe knowledge base. We eliminate the combinations offaulty objects which make observed symptoms impossibleand shortlist the rest of the combinations of faulty objects aspotential causes of symptoms. This shortlist of possiblecombinations is called the hypothesis list.

We consider two kinds of symptom information in thisalgorithm: negative and positive. Note that this step can beparallelized, i.e., different combinations can be processed inparallel to check their feasibility for the given end-to-endsymptoms. Lines 24-45 show the implementation of step IIin the pseudo-code given in Fig. 2.

If there areN objects in the possible faulty object list, then

the search space is 2N � 1. Thus, the computational com-plexity grows exponentially in N . To decrease the run-timeof the algorithm, we present refinements to the base netCSIalgorithm that utilize both positive and negative symptomsbelow.

4.2 Decreasing Computational Complexity

4.2.1 Pruning

We can decrease the run-time of netCSI by reducing thesearch space of combinations. The underlying idea of prun-ing is to recognize the combinations that are not possiblebecause they disconnect all the routes inXi;j between sourcesi 2 S and destination dj 2 Di, which are actually connectedaccording to the given positive symptoms, i.e., I i;j ¼ 1. Weinclude all such combinations in the Impossible Set, repre-sented by IS.

Since the combinations of faulty objects in IS are not pos-sible causes of failure, any other combination that includesat least the same faulty objects as a combination in IS cannotbe a possible cause of failure. Therefore, Step II of thehypotheses generation algorithm, can be avoided in thecase of any combination which is a superset of a combina-tion in IS.

Note that the advantage of this pruning can be realizedcompletely only when combinations are searched in ascend-ing order of cardinality, i.e., the combinations with onefaulty object are considered first, followed by combinationswith increasing numbers of faulty objects, since combina-tions with the lowest cardinality have the highest number

of supersets. Pruning can be applied to the base netCSI algo-rithm. We call this approach OPT1. In our simulations, theaverage reduction in the search space of OPT1 is 94 percentwhen compared to base netCSI.

4.2.2 Thresholding

The motivation for thresholding also comes from the objec-tive to decrease the search space of the combinations of pos-sible faulty objects. In thresholding, we eliminate somecombinations of objects from the search space with the helpof certain pre-defined attributes, depending on the circum-stances and nature of the network [2], [3].

Fig. 2. Pseudo-code.


Attributes can be the number of faulty objects in the com-bination [3], threshold values for the joint probability of thecombination, etc. For example, if the attribute is the numberof faulty objects, then the combinations where the numberof faulty objects is higher than the threshold value are elimi-nated from the search space to decrease the number of com-putations required.

Both pruning and thresholding can be simultaneouslyapplied to the base netCSI algorithm. We call this approachOPT2. In our experiments, we found that OPT2 with aproper threshold value that does not cause any loss in accu-racy can reduce the search space by 97.5 percent when com-pared to base netCSI. Note that the netCSI has exponentialcomputational complexity as we explained before, thereforean extra reduction of 3:5 percent with thresholding is verysignificant. We will assess this trade-off in our experimentalresults in Section 7.4.1.

Unlike pruning, thresholding reduces the search spacewith a potential loss in accuracy of diagnosis: the hypothesislist might lose some combinations, i.e., it might be less accu-rate. However, with an appropriate threshold value, thresh-olding does not change the accuracy.

To appropriately set this value, we devise a new tech-nique called dynamic threshold value selection. We findthat the appropriate value is close to the expected numberof faults that can occur due to a large scale failure, i.e., wehave to choose a different threshold value in different fail-ure scenarios. We estimate the number of failures based onsymptoms in the network to select an appropriate thresh-old value.

Dynamic threshold value selection. This technique is basedon Bayesian decision approach. We make use of both train-ing data and the current observed negative symptomswhen a failure occurs, to estimate the expected number offailures. We then use this number as the threshold value.

We consider training data with different samples of fail-ures, and stores the resultant number of negative symptomsfor each sample along with its frequency of occurrence, i.e.,a histogram with number of negative symptoms as an inde-pendent variable, which is represented by A. This histogramdata is divided into different bins that represent differentranges of failures, i.e., in terms of number of failures. Thetotal number of bins is represented by b, and each bin isindexed by i.

We estimate the probability density function, CFi, foreach bin i using kernel density estimation [20]. Thus, thetraining data will result in different probability densityfunctions corresponding to various bins. In addition, thetraining data consists of both offline and online failure data.The offline failure data is acquired by choosing randomsamples of clusters in the network and then using networkmeasures as explained in the next paragraph. The onlinefailure data is recorded during the occurrence of actualfailures.

For offline failure data, we consider both independentand clustered failures. We use the network measuresbetweenness centrality (BC) and group betweenness cen-trality (GBC), to determine the end-to-end negative symp-toms that will result from a failure instead of enumeratingthe symptoms by simulating all possible failures in the net-work. BC for an object is defined as the fraction of paths

that pass through the object, out of all the paths between dif-ferent nodes in the network. Conventionally, BC is based onsingle shortest paths in a given network; here the definitionis with respect to all pre-failure paths, Xij; i 2 S; j 2 Di thatare available to the network manager. GBC of a cluster isdefined as the fraction of paths that pass through that par-ticular cluster out of all the paths between different nodesin the network. In the above definitions, the consideredpaths between nodes are the ones which are available to thenetwork manager from reporting nodes. Note that the pathsare considered to be loop-free here.

Since we consider multiple paths between source anddestination nodes, in a single object failure case, the numberof negative symptoms is approximately equal to BC of thefailed object. Moreover for independent failures, when mul-tiple objects fail simultaneously, the total number of end-to-end negative symptoms is estimated as the sum of the BCvalues of the failed objects.

For clustered failures, we choose samples that includeboth random cluster and random space failures (Section 7.1)depending on the available cluster information (CI) at thenetwork manager. For both modes, the number of end-to-end negative symptoms is equal to cluster’s GBC value.

When a large-scale failure occurs in the network, weobserve the number of negative symptoms at a failureinstance, ns. Given ns, and the probability density func-tions of various bins (the histogram discussed earlier inthis section), we use a Bayesian approach to identify thesize of the failure set. We then use this as the thresholdvalue. Our technique is described in Equations (1) and(2). The prior probability, P ðCFiÞ, for a bin i is evaluatedusing training data. In the absence of training data, uni-form priors are assumed,

P ðCFijA ¼ nsÞ ¼ P ðA ¼ nsjCFiÞP ðCFiÞPbi¼1 P ðA ¼ nsjCFiÞP ðCFiÞ

(1)

Selected Bin Number ¼ argmaxi

P ðCFijA ¼ nsÞ: (2)

We evaluate netCSI with dynamic threshold value selec-tion in Section 7.4.6.

5 NETCSI:RANKING

In this section, we present a ranking algorithm that utilizes aconditional failure probability model to rank the hypothesis list.

5.1 Ranking Algorithm

Once the hypothesis list is available, we employ a rankingalgorithm to rate the possible combinations based on themaximum likelihood principle. The combinations in thehypothesis list are ranked in the decreasing order of theirjoint probabilities. Here, the joint probability of a combina-tion is the probability that the events corresponding to bothfaulty and non-faulty objects will happen at the same time.

To demonstrate the idea behind the computation of jointprobability, we look at a general combination with bothfaulty and non-faulty objects, i.e., a combination of Nobjects in P, a set of possible faulty objects (Step I inSection 4.1) that has arbitrary sets of k faulty and N � knon-faulty objects. We denote the set of faulty objects asF ¼ fo1; o2; . . . ; okg and the set of non-faulty objects as


S ¼ fokþ1; okþ2; . . . ; oNg. Therefore by construction:F ;S � P, F [ S ¼ P, and F \ S ¼ ;. The failure event of anobject oi 2 P is modeled using a random variable, Fi. Thevalue of Fi is either 1 if oi is faulty, or 0 if oi is not faulty.The expression for the joint probability of objectsðo1; o2; . . . ; okÞ being faulty and objects ðokþ1; okþ2; . . . ; oNÞbeing not faulty is,

P ðF1 ¼ 1; . . . ; Fk ¼ 1; Fkþ1 ¼ 0; . . . ; FN ¼ 0Þ ¼ P ðF1 ¼ 1Þ� P ðF2 ¼ 1jF1 ¼ 1Þ � :: � P ðFk ¼ 1jF1 ¼ 1; F2 ¼ 1; . . . ;

Fk�1 ¼ 1Þ � P ðFkþ1 ¼ 0jF1 ¼ 1; F2 ¼ 1; ::;¼ Fk ¼ 1Þ� :: � P ðFN ¼ 0jF1 ¼ 1; . . . ; Fk ¼ 1; . . . ; FN�1 ¼ 0Þ:

(3)

The individual conditional probabilities on the R.H.S. ofequation (3) are computed with the help of the conditionalfailure probability model as explained in the next section.

5.2 Conditional Failure Probability Models

The main objective of conditional failure probability modelsis to determine the values of conditional probability that anobject in an arbitrary set is faulty or not given the faultinessof other objects in the same set. These values are used tocompute the joint probability of a combination as shown inequation (3).

Here, we consider that the network may be divided intodifferent clusters based on various attributes of the network.For example, objects in the same building may be consid-ered to be in one cluster or objects enclosed in a specifiedgeographical area may be labeled as one cluster. Given clus-tered failures, it is most likely that all objects in a clustermay be either faulty or not faulty collectively.

Alternatively, we can also consider that objects within acertain distance metric from each other, e.g., physical dis-tance or hop count, are most likely to fail together. In thiscontext, there can be three possible kinds of information:cluster information, object distance information (OD), andno information (NI). We do not describe the conditional fail-ure probability model with no information (CFPM-NI) herebecause it is straightforward, and details can be found in [21].

5.2.1 Conditional Failure Probability Model with Cluster

Information (CFPM-CI)

In CFPM-CI, we consider that there are c clusters in total;the number of objects in each cluster can vary. We representthe cluster of object oi as Ci. We consider two arbitrary sets

of faulty and non-faulty objects, represented by F0 and S0,respectively. As before, F0;S0 � P and F0 \ S0 ¼ ;. Theprobability of an object oi being faulty conditioned overother remaining faulty and non-faulty objects is given inequation (4) and the conditional probability that an object oiis not faulty given F0 and S0, P ðFi ¼ 0jF 0;S0), is simply thecomplement of the probability given in equation (4),

P ðFi ¼ 1 j F 0;S0Þ ¼ pc; if 9 oj 2 F0 : Ci ¼ Cj ^ j 6¼ i1� pc; otherwise:

�(4)

The above equation signifies that if any of the objects inthe same cluster as object oi is faulty then the object oi is

faulty with a probability pc. Since we are dealing with clus-tered failure scenarios, pc should be very close to 1.

We consider homogeneity in clusters when it comes vul-nerability to clustered failures, and hence we propose a con-stant value for pc. If we know information about thevulnerability of all clusters for clustered failures, then wecan introduce variable values for pc and extend our modelto that setup.

5.2.2 Conditional Failure Probability Model with Object

Distance Information (CFPM-OD)

Here, we deal with a situation where no information aboutclusters is known, but some metric of distance betweenobjects in the network is known. The distance betweenobjects can be measured through various parameters, forexample euclidean distance, hop distance, etc. We definethe distance between two objects oi and oj as a functiondði; jÞ.

As explained above, we consider two arbitrary sets offaulty and non-faulty objects, F0 and S0. The conditionalprobability that an object oi is faulty given the remainingfaulty and non-faulty objects is defined in equation (5).Again, the conditional probability that an object oi is not

faulty, P ðFi ¼ 0 j F 0;S0Þ, is just the complement of the prob-ability given in equation (5). Here, the function dði;F0Þ inequation (5) is defined as the minimum distance betweenthe object oi and any object of the set F0 because the eventthat an object is faulty or not faulty depends on the closestfailed object,

P ðFi ¼ 1 j F 0;S0Þ ¼ pdði;F0Þ

o ; (5)

where po 2 ½0; 1�, and dði;F0Þ ¼ minðdði; jÞÞ 8oj 2 F0 andoj 6¼ oi. In this paper, hop distance between objects is usedas the distance measure in the simulations.

6 RUN-TIME ANALYSIS

For this analysis, we define the term Search Space to repre-sent the number of combinations for which the hypothesesgeneration and ranking algorithms of netCSI must beapplied; SSPOSþNEG represents the search space of basenetCSI and SSOPT1 represents the search space for therefinement OPT1. Also, let m be the minimum cardinalityof all the combinations in the Impossible Set (IS) for a given

scenario. SSPOSþNEG is equal to 2N � 1, where N is the num-ber of possible faulty objects. N depends on the number ofobjects (nodes or links) in the network and number of routesconsidered by netCSI. Therefore, this analysis offers anunderstanding on how netCSI scales with larger networks.

With pruning, portions of the search space will be elimi-nated for any scenario with a non-empty IS. Specifically,we can define upper and lower bounds on SSOPT1 for anygiven m and non-empty IS. The upper bound is achievedwhen IS has only one combination with m objects, and thelower bound occurs when IS has all possible combinationswith m objects. These upper and lower bounds of SSOPT1

are shown in equation (6),

Xmi¼0

N

i

� �� 1 � SSOPT1 � ð2N � 1Þ � ð2N�m � 1Þ: (6)


In the case of thresholding, with an arbitrary thresholdvalue, th, the original search space of 2N � 1 is reduced to G

=ð2N � 1Þ � ðð Nthþ1Þ þ ð N

thþ2Þ þ þ ðNNÞÞ. Since OPT2 utilizes

both pruning and thresholding, we can follow a logic simi-lar to that is used for OPT1 to evaluate the bounds of

SSOPT2, using G as the search space instead of 2N � 1.

In order to see the benefit of pruning more easily, we can

look at the fractional reduced search space, which lies in the

interval ½Pm

i¼0ðNi Þ�1

2N�1;ð2N�1Þ�ð2N�m�1Þ

2N�1�. For example, with m ¼ 1

and N ¼ 10, the search space is reduced to as low as

0.98 percent and as high as 50 percent of its original size.

From our experiments in Section 7.4, we picked 10 cases

with different numbers of possible faulty objects (N). In all

cases, the minimum cardinality of all the combinations in

IS is 1, i.e., m ¼ 1. From Fig. 3, we see that the experimental

values are close to the upper limits of the reduction in

search space, resulting in a search space of less than 1 percent

of that which is processed by the base netCSI algorithm.

Please note that the lower bound of SSOPT1 can bereduced to a polynomial (Nm) for small values of m, and itis a best-case scenario when all possible combinations withm objects are present in IS. This can be a useful insightwhen the scale of the network is increased, i.e., increasingthe value of N. However, the determination of value m forlarge networks is non-trivial and depends on scale of fail-ures primarily. In addition, the value m depends on thenumber and the kind of positive and negative symptomsthat are available in the network.

7 EVALUATION AND RESULTS

7.1 Experimental Setup

We evaluate the performance of netCSI on two differenttypes of topologies, random and realistic. The random net-works are generated by placing nodes at random locationsin a two-dimensional space and the links are generatedbetween the nodes like a waxman random graph [22].Topologies are generated in circular areas whose radiusis varied from

ffiffiffiffiffi10

pto

ffiffiffiffiffi40

pand the communication range

of nodes is varied from five to 15. The realistic topology istaken from Rocketfuel trace data provided by the Universityof Washington [23]. A network with 315 backbone routersthat are connected by 972 links in an ISP (Sprint in US).

In both the networks, multiple paths are generatedamong all nodes using Dijkstra’s algorithm. We consider amaximum of five paths between each pair of nodes in the

network. These paths are generated after arbitrarily failingcertain links in the network, and are of varying lengths.Without loss of generality, we only consider links as objectsin a path. In every experiment, a certain number of report-ing nodes is selected randomly out of all nodes to reportend-to-end symptoms.

For simulation purposes, clustering is done in differentways for random networks and the network with realistictopology. In random networks, the network is dividedinto different disjoint clusters using the k-means clusteringalgorithm [24]. With the number of clusters (k) as theinput, the k-means clustering algorithm generates k clus-ters, each with variable number of nodes. For the networkwith realistic topology, all the nodes in a particular cityare regarded as one cluster; there are 24 clusters in total.

We simulate two types of failure modes: random clusterfailure and random space failure. In the random cluster failuremode, we choose a random cluster out of all clusters in thenetwork and fail all objects in that particular cluster. Thisfailure mode mimics a realistic scenario where all objects ina building (considered as one cluster) may fail due to vari-ous reasons. For random space failure mode, a geometriccircle with a prescribed radius r is randomly chosen, and allobjects within that circle are failed. This scenario is similarto a case where objects that are geographically adjacent inmultiple clusters can fail together, or a group of objectswhich are geographically close in a large wireless network,can fail simultaneously. Note that in Rocketfuel topologies,we simulate only random cluster failures and not randomspace failures.

We parallelized the code of netCSI, and used an eightcore processor to improve the run-time. Since the fault diag-nosis is done at the network manager in our case, we expectthat netCSI can run on an advanced infrastructure inpractice.

7.2 Performance Metrics

7.2.1 Hypotheses Generation Algorithm

Several metrics can be used to measure the performance ofthe hypotheses generation algorithm. The two metrics areprecision and accuracy. Precision of the hypothesis list isdefined as the size of the hypothesis list. A smaller hypothe-sis list is said to be more precise, since the network managerhas to deal with fewer possible causes of failure based onsymptoms. Accuracy of the hypothesis list is defined as thefraction of faulty objects that are present in the groundtruth. More details are given in Section 7.2.3. In addition,we also consider the size of the search space, which is directlyrelated to the time complexity of the hypotheses generationalgorithm, and thus determines the run-time of netCSI.

7.2.2 Ranking Algorithm

In the context of inputs given to netCSI, we define theground truth as the set of all objects that have failed whichare also present in at least one path from any given report-ing node. The set of faulty objects in the ground truth isdefined as actual faulty objects (AFO).

We define cumulative rank (CR) to evaluate the perfor-mance of the ranking algorithm. We consider only a subsetof objects as faulty objects instead of considering all objects

Fig. 3. Reduction in search space of OPT1 for differentN withm ¼ 1.


by using the ranks of different combinations in the hypothe-ses list. In that context, CR is a particular rank above whichthe objects in the union of combinations at higher ranks(lower numeric values of rank) are treated as faulty objects.Cumulative rank with ground truth (CRGT ) is a special valueof CR that is defined as the lowest rank above which allfaulty objects of the ground truth are accumulated.

For example, consider a sample hypothesis list withseven combinations that is ranked as shown in Table 2. Letthe ground truth be the combination ðo1; o2; o4; o5Þ. In thisexample, the exact rank of the ground truth is five, whereasCRGT is three. The faulty object set above CRGT isO1 [O2 [O3 ¼ ðo1; o2; o3; o4; o5Þ. This set is called Cumula-tive Faulty Objects (CFOCRGT

), and can be defined for any

cumulative rank (CFOCR).

7.2.3 False Positives and False Negatives

False positives and false negatives are possible with respectto both the hypothesis list and CR. False positives withrespect to CR (FPCR) are objects which are present inCFOCR, but not in AFO. False negatives with respect to CR(FNCR) are objects which are part of AFO, but not inCFOCR. The expressions for %FPCR and %FNCR are givenbelow. In addition, we also define accuracy for fault diagno-sis algorithms as the fraction of objects in the ground truththat are diagnosed as faulty, i.e., accuracy is the complementof FNCR. Different CR values give distinct CFOCR, and thusdifferent false positives and false negatives. The effect ofCR over accuracy, false positives, and false negatives willbe seen in more detail in Section 7.4.

%FPCR ¼ jCFOCR �AFOj � 100jCFOCRj ;

%FNCR ¼ jAFO� CFOCRj � 100jAFOj :

7.3 Robustness of CFP Models

To understand the robustness of CFP models, we have runsimulations on a random network with 100 nodes, k ¼ 20clusters and 10 randomly selected reporting nodes. OPT1 isemployed for fault diagnosis. All three CFP models arecompared under two failure modes, random space and ran-dom cluster failures. We evaluate the metric average exactmatch rank, which is defined as the rank of the ground truthin the ordered hypothesis list. The effect of parameters pcand po (equations (4) and (5)) on the ranking algorithm isexamined by choosing values from the set {0:5,0:9}. We con-sider constant values for these parameters, but pc and po

could be adaptively learned from occurrences of clusteredfailures in the network. This is considered as a future work.

Fig. 4 shows that the performance of the ranking algo-rithm using the CFPM-CI dominates both CFPM-OD andmodel with no information. The CFPM-CI performs betterwhen pc ¼ 0:9 in both the failure modes. However, CFPM-OD performs better when po ¼ 0:5 compared to 0:9 in the

case of random space failure mode where r2 ¼ 20. This is

not true in the case where r2 ¼ 40. The reason is that the

case r2 ¼ 20 represents a smaller clustered failure scenario

compared to the scenario where r2 ¼ 40. Here, r is theradius of the circle. A similar observation can be made forrandom cluster failure scenarios, when k ¼ 30 (smaller clus-ter size) compared to the case when k ¼ 10 (bigger clustersize) as shown in Fig. 4. Here k is the number of clusters.Regardless of the failure modes, CFPM-CI with pc ¼ 0:9 isthe most robust of all the models. Thus, we use this modelfor all subsequent simulations in Section 7.4. Nevertheless,having object distance information is still better than nothaving any information.

7.4 Evaluation Results

In this section, we present the results to gauge the perfor-mance of netCSI, and also compare it with MAX-COVER-AGE. The experiments are run by assuming that thenetwork manager has complete knowledge of the multiplepaths that are consistent with the assumptions given inSection 3.1, and the network topology. This is a standardassumption in prior works such as [1], [2], [3], [7] includingMAX-COVERAGE and is a valid assumption if the networkmanager owns the network.

7.4.1 Performance of Different netCSI Approaches

Here, we analyze the performance of netCSI with both posi-tive and negative information (POS þNEG), and only neg-ative information (NEG). For the former case, we evaluatethe improvements that can be achieved with both OPT1and OPT2. We consider a random network with 100 nodesdivided into 20 clusters (k ¼ 20) and three instances of ran-dom space failure scenarios with varying radius (r). In allcases, the number of reporting nodes is 10, and each casewas run 10 times. The data for various netCSI approaches isgiven in Table 3.

Both POS þNEG and NEG have a maximum search

space of 106 combinations in the case of r2 ¼ 20. InFig. 5a, the average search space of OPT1 and OPT2 is5.9 and 2.4 percent of the total combinations, respectively;the corresponding numbers are shown in the Table 3. Here,

TABLE 2A Sample Hypothesis List

hypothesis list

Rank Combinationof objects

Rank Combinationof objects

1: O1 ¼ ðo1; o2Þ 5: O5 ¼ ðo1; o2; o4; o5ÞO5 ¼ ðo1; o2; o4; o5Þ2: O2 ¼ ðo1; o3Þ 6: O6 ¼ ðo1; o2; o3; o6Þ3: O3 ¼ ðo4; o5Þ 7: O7 ¼ ðo1; o2; o3; o4; o6Þ4: O4 ¼ ðo1; o3; o5Þ

Fig. 4. Robustness of CFP models.


we considered 10 as the threshold for OPT2. Since size ofthe search space is proportional to the time complexity ofnetCSI, we expect OPT2 to have the lowest run-time of allthe netCSI approaches.

In Fig. 5a, we can see that the size of the hypothesis listfor POS þNEG is 0.0015 percent of total combinations,whereasNEG lists 3.39 percent of total combinations as pos-sible combinations. The reason for fewer possible combina-tions in the case of POS þNEG when compared to NEG isbecause the positive information eliminates false positives.In addition, we observe that the number of possible combi-nations is the same for both OPT1 and POS þNEG (0.0015percent). Thus, the precision of the hypothesis list does notchange, i.e., netCSI will not miss any faulty link in theground truth and hence there is no change in accuracy withpruning. However, OPT2 outputs a hypothesis list withmuch fewer combinations (0.0009 percent) when comparedto POS þNEG, so it is possible that we may not find all thefaulty links in the ground truth with thresholding. There-fore, the accuracy of OPT2 might decrease when comparedto OPT1 depending on the threshold value.

Hence, in the case of OPT2, there is trade-off between theaccuracy and the size of the search space. In Fig. 5b, we plotthe average size of the search space and accuracy (withCRGT ) for different threshold values under OPT2. The accu-racy of OPT2 increases with increase in the value of thethreshold. For th ¼ 9, there is no loss of accuracy in OPT2,however the size of the search space for OPT2with th ¼ 9 ismuch smaller than that of OPT1 as shown in Fig. 5b. Fur-thermore, in Fig. 5c, we show that both OPT2 with th ¼ 10and OPT1 have almost the same %FPCR and %FNCR for

different cumulative ranks. Therefore, for a properly chosenthreshold value, OPT2 can be as effective as OPT1, alongwith the added advantage of lower search space. Hence-forth, in the remainder of paper, OPT2 means netCSI thatincludes the refinements: pruning, and thresholding with aproper threshold value.

7.4.2 Performance of Ranking Algorithm

We use CFP-CI with pc ¼ 0:9 for our ranking algorithm torate possible combinations in the hypothesis list. In Table 3,we can see that the average CRGT ¼ 1 for almost all netCSIapproaches, i.e., we find all the faulty objects in the highest-ranked combination in the hypothesis list; however, falsepositives differ from one approach to another. Because ofpositive information, OPT1 (30 percent) and OPT2 (25 per-cent) have lower%FPCRGT

compared to NEG (70:3 percent).

Due to the smaller hypothesis list, OPT2 gives fewer falsepositives than OPT1. In Fig. 5c, we show that %FPCR

increases with increase in the cumulative rank for bothOPT2 and OPT1, because the CFOCR list becomes bigger aswe increase the cumulative rank. Please note that the aver-age %FNCR for both OPT2 and OPT1 with respect to cumu-lative ranks saturates at 42 percent.2

Incidentally, from Table 3 the average exact match rankof both OPT2 and OPT1 is in the bottom 25th percentile ofthe hypothesis list. This is due to the disparity in simulatedfailure mode (random space failure) and CFPM-CI which is

TABLE 3Data Comparing Various netCSI Approaches

Algorithm r2 Avg. linksfailed

Avg totalcombs

Avg combsin SS

Avg possiblecombs

AvgCRGT

Avg exactmatch rank

AvgFPCRGT

20 6:7 9:73 � 105 9:73 � 105 3:3 � 104 1 2:62 � 104 74:4%NEG 30 7:1 1:34 � 106 1:34 � 106 3:3 � 104 1 2:63 � 104 73:5%

40 8:9 2:17 � 106 2:17 � 106 7:87 � 104 1 6:57 � 104 71:7%20 6:7 9:73 � 105 5:75 � 104 29:1 1 24:2 26:6%

OPT1 30 7:1 1:34 � 106 7:91 � 104 29:1 1 25:2 25:7%40 8:9 2:17 � 106 1:28 � 105 31:7 1 28:4 28:2%20 6:7 9:73 � 105 2:70 � 104 21:3 1:4 16:4 25:5%

OPT2 30 7:1 1:34 � 106 3:56 � 104 17:4 1:6 13:5 24%40 8:9 2:17 � 106 5:26 � 104 18:9 1:8 15:6 26:5%

Fig. 5. Evaluation of different netCSI approaches.

2. Here the number of reporting nodes are 10. By increasing thenumber of reporting nodes, i.e., more symptom information, we candiagnose more links that are faulty.


employed for the ranking algorithm. Hence, most of thetime the CRGT lies in the top 5th percentile of the hypothesislist as given in Table 3. Therefore, the ranking algorithm ofnetCSI performs much better even though the models aredistinct, which is common in practice.

7.4.3 Effect of Varying Number of Reporting Nodes

We now evaluate netCSI and compare it with MAX-COV-ERAGE for the situations when there is incomplete end-to-end symptom information. Since the focus is on large scalefailures, acquiring end-to-end symptom information fromall the reporting nodes may not be feasible. Furthermore,the major drawback in collecting all the symptom informa-tion is redundancy, which may lead to unnecessary wastageof critical resources in the network. However, with moresymptoms, netCSI produces a hypothesis list that results inbetter accuracy and fewer false positives and false negatives.

Results are given for networks with both random andrealistic topologies while varying the number of reportingnodes in Fig. 6. Since the output of MAX-COVERAGE is aset of objects that could be faulty, it is straightforward todetermine the accuracy. However, the hypothesis list ofnetCSI contains different possible combinations of objects(multiple sets), so we consider CFOCR (a list of possiblefaulty objects) to evaluate the accuracy of netCSI. For boththe networks, we employ OPT2 with different thresholdvalues, and simulate a random cluster failure scenario with10 trials.

In Fig. 6b, for realistic topology, we consider CFOCRGTto

compute the accuracy of netCSI with OPT2 with differentthreshold values. netCSI outperforms MAX-COVERAGEwhen the reporting nodes are varied between 10 and 100.3

The maximum and average gain in accuracy of OPT2th¼7

are 567 percent (10 reporting nodes) and 128 percent,respectively, when compared to MAX-COVERAGE asshown in Fig. 6b. Furthermore, in Fig. 6d, we present theaccuracy of OPT2th¼7 for different CR while varying thenumber of reporting nodes. The average gain in the accu-racy for OPT2th¼15 with CR ¼1, 2 and 3 over MAX-COVER-AGE are 107.2, 115.8 and 119.4 percent respectively. Hence,the network manager can pick a very small number as CRand achieve as much accuracy as CRGT .

We now consider a random network of 100 nodes with 20clusters. In Fig. 6a, we give the accuracy ofOPT2 considering

CFOCRGT. When there are fewer reporting nodes, i.e., for a

number lower than 75, both OPT2th¼15 and OPT2th¼10 aremore accurate thanMAX-COVERAGE. Now, we specificallycompare the accuracies of OPT2th¼15 and MAX-COVER-AGE; the maximum gain is 138 percent (10 reporting nodes),and the average gain is 45 percent. However, OPT2th¼5 haslower accuracy thanMAX-COVERAGE, and does not followany trend as shown in Fig. 6a. The reason is that combina-tions in which the number of simultaneous failures is morethan five are not considered. In Fig. 6c, the average gain inthe accuracy of OPT2th¼15 over MAX-COVERAGE is almostthe same for both CR ¼ 3 and CRGT (Fig. 6a). Here alsoOPT2th¼15 with just CR ¼ 1 can achieve an average gain of41 percent overMAX-COVERAGE.

From the results for both the networks, we conclude thateven when the network manager picks a very small numberas CR, we can achieve as much accuracy as CRGT . In addi-tion, with 10 reporting nodes, we observe that OPT2 ismore accurate for a network with a realistic topology thanrandom topology. This is because our random topology ismore dense than the considered realistic topology, and withgiven (small) number of symptoms netCSI outputs morepossible combinations in the hypothesis list for dense net-works than sparse ones.

In Fig. 7, we plot FPCRGTand FNCRGT

, with OPT2th¼15 forrandom network, and with OPT2th¼7 for network with

Fig. 6. Evaluation of netCSI while varying number of reporting nodes.

Fig. 7. False positives and false negatives.

3. We have considered only 100 as maximum number of reportingnodes (limited no. of symptoms) and there are 315 nodes in the net-work, so the accuracy for all considered netCSI approaches and MAX-COVERAGE do not reach the value 1 in our results.


realistic topology, along with false positives and false nega-tives of MAX-COVERAGE. In both networks, FPCRGT

and

FNCRGTof OPT2 are lower than those for MAX-COVERAGE

in most of the cases. In addition, both FPCRGTand FNCRGT

of OPT2 decrease with an increase in the number of report-ing nodes for both the networks. However, they becomemore or less saturated for the cases with number of report-ing nodes > 50, i.e., the extra information that we get fromreporting nodes 50 to 100 could be considered as redundant.Therefore, the trade-off between accuracy and overhead incollecting the information from reporting nodes, could beexploited by selecting an optimum number of reportingnodes (minimum overhead) with exactly or close to 0 per-cent false negatives, i.e., almost all the faulty objects in theground truth are localized.

7.4.4 Run Time of netCSI

In both the networks, with an increase in the value ofthreshold, the accuracy of OPT2 increases as shown inFig. 6, but at the same time, as shown in Fig. 8, the run-timealso increases. Hence, the gain in accuracy of OPT2 com-pared to MAX-COVERAGE comes with additional over-head of run-time. As shown in Figs. 8a and 8b, the run-timeof OPT2 with different threshold values do not follow anytrend when we vary the number of reporting nodes.

In Section 6, we showed that the run-time of netCSIdepends on the number of possible faulty objects (N)exponentially, and on the number of symptoms (cþ d) line-arly. Even though the value of (cþ d) increases with theincrease in the number of reporting nodes, there could bean instance where N is very high in one of the trialsregardless of number of reporting nodes. These type ofinstances can be seen in both the Figs. 8a and 8b whenthe reporting nodes are in between 10 and 50. Moreover,since the code of netCSI is parallelized, the effect of pro-portionality on run-time due to number of symptoms is

reduced. In addition, with more parallelization, the over-all run-time of netCSI could be decreased even further,unlike MAX-COVERAGE.

7.4.5 netCSI with Dynamic Threshold Value Selection

As explained in Section 4.2.2, netCSI with thresholding isefficient when we select an appropriate threshold value.Until now, we evaluated netCSI with a static thresholdvalue irrespective of the failure, i.e., we handpick a thresh-old value manually for all failures. We propose a techniquecalled dynamic threshold value selection in Section 4.2.2 thatselects a dynamic threshold value for each failure instance.

As discussed in Section 4.2.2, when a failure occurs, ourtechnique uses both training data and the current observednegative symptoms to estimate the threshold value. Data isdivided into different bins that represent different ranges offailures, i.e., in terms of number of failures. We consider theselected range as the size of the bin. During a failure instancewe select the threshold value as the maximum number offailures in the range of the selected bin, and we call it asdynamic threshold value.

In Fig. 9, we compared the performance of netCSI for var-ious cases of static and dynamic threshold values. We con-sidered static threshold values of 5, 10, 20, 30 and 40, anddynamic threshold values with bin sizes of 5, 10, 15 and 20.The number of reporting nodes is 10. We run experimentsfor six failure cases and show the average values of runtime and accuracy in Fig. 9. We consider 10,000 offline sam-ples in the training data.

From Fig. 9, we see that the accuracy of netCSI increaseswith increasing static threshold value, as expected. In thecase of the dynamic threshold with various bin sizes, weachieve the same accuracy as the static threshold but withlower run time. We also provide the average thresholdvalue (in parentheses in Fig. 9) for all cases where thedynamic threshold value selection is used.

Even though the gap between the values of averagethreshold for static and dynamic cases is high, the differencein run times is just little more than significant. The main rea-son is our pruning technique (Section 4.2.1): many combina-tions with a large number of objects get pruned from thesearch space. Therefore, our technique enables us with anautomated mechanism to choose a threshold value depend-ing on the failure instance instead of handpicking a staticvalue. In addition, our dynamic threshold selection canused in various other inference techniques [3].

Fig. 8. Run-time of netCSI.

Fig. 9. Static and dynamic threshold values.


In our experiments, given 10 reporting nodes, we simu-lated 2,000 failures that consist of both random space andrandom cluster failures. For two cases of training datawith 10,000 and 20,000 samples, we observe that the accu-racy of picking the correct bin decreases with decrease inthe bin size. For 20,000 samples, we observed that 60 per-cent of the time we pick correct bins for bin sizes that aremore than 15.

7.4.6 Effect of Partial Measurements at a Reporting

Node

When we vary the number of reporting nodes in Section7.4.3, we assume that complete information is available ateach reporting node. However, this assumption may not betrue in realistic scenarios. For example, a reporting nodemay not know every reachable or unreachable destination.When the reporting nodes report both negative and positivesymptoms to the network manager, i.e., if the connectivitystatus of any destination node corresponding to a reportingnode is unknown, then no symptom is reported. Therefore,the network manager receives incomplete positive and neg-ative symptoms from various reporting nodes.

In Fig. 10, we show the effect on accuracy, when the par-tial amount of symptoms at a selected reporting node ran-domly falls in the range between 30 and 90 percent , i.e., (30percent, 90 percent) of total number of both positive andnegative symptoms. In this case, we consider 30 reportingnodes. The performance of both netCSI and MAX-COVER-AGE degrade when there is partial information at reportingnodes compared to the case with complete information, asexpected. However, with partial information at everyreporting node, netCSI performs much better than MAX-COVERAGE.

We then evaluate the effect of partial information on theperformance of algorithms while varying the number ofreporting nodes. We consider cases in which the partialinformation is 90, 60 and 30 percent of all symptoms atevery chosen reporting node. These results are given inFig. 11. For all cases, with 30, 60, 90 percent, and (30 percent,90 percent) partial information, the accuracy of both netCSIand MAX-COVERAGE decreases compared to the scenariowhere every reporting node has 100 percent information. InFig. 11, please note that we do not show the results for 60percent partial information because of space constraints.With only 30 percent partial information, the accuracy of

netCSI decreases drastically. This indicates that choosingreporting nodes with a greater amount of symptoms ismore important than selecting a large number of reportingnodes that have significantly lower amount of symptoms.

In Fig. 11, MAX-COVERAGE with various amounts ofpartial information follows the same trend as netCSI, butperforms worse than netCSI. We also observe that the accu-racy of MAX-COVERAGE with 100 percent information ismuch lower than netCSI when operating with substantialamount of partial symptoms (i.e., more than 30 percent ofsymptoms) when there are a limited number of reportingnodes (� 40 reporting nodes). This shows that netCSI ismore robust than MAX-COVERAGE under partial informa-tion at a reporting node, when there are few reportingnodes.

7.4.7 Performance of netCSI under Independent

Failures

We perform simulations under independent failures andobserved that the hypothesis list of netCSI is highly accuratewith no false negatives, but contains many false positives.For multiple runs of random independent failures, netCSIreturns 70-75 percent of false positives. This number is veryhigh compared to 20 percent of false positives returned byMAX-COVERAGE. The reason is the assumption that thegeographically adjacent objects fail in clustered failures,and due to this correlation netCSI overestimates the numberof failed objects. Using this insight, we propose an adaptivealgorithm that works for both independent and large-scalefailures in our subsequent work [25].

8 CONCLUSION

In this paper, we developed netCSI to accurately diagnosefaults due to large scale failures in spite of having incom-plete symptom information. While varying the number ofreporting nodes, netCSI accomplishes an average gain of128 and 45 percent in accuracy over MAX-COVERAGE forrealistic and random topologies, respectively. In our resultswe observed that reporting nodes with many symptoms aremore important than reporting nodes with very few symp-toms. netCSI utilizes both positive and negative symptomsalong with knowledge of the pre-failure network topologyto generate a hypothesis list of possible combinations that

Fig. 10. Partial information at reporting nodes.Fig. 11. Comparison of netCSI and MAX-COVERAGE under partialinformation at reporting nodes.


cause the symptoms. Due to inclusion of positive symp-toms, the size of the hypothesis list is reduced by almost99.9 percent, and the percentage of false positives is reducedby at least 60 percent when compared to netCSI with onlynegative symptoms. In addition, the reduction in the searchspace of netCSI, due to pruning and thresholding, are 94and 98 percent, respectively. We also show the effectivenessof dynamic threshold selector while employing thresholding.

ACKNOWLEDGMENTS

This research was sponsored by the US Army Research Lab-oratory and the U.K. Ministry of Defence and was accom-plished under Agreement Number W911NF-06-3-0001. Theviews and conclusions contained in this document are thoseof the author(s) and should not be interpreted as represent-ing the official policies, either expressed or implied, of theUS Army Research Laboratory, the US Government, theU.K. Ministry of Defense or the U.K. Government. The USand U.K. Governments are authorized to reproduce and dis-tribute reprints for Government purposes notwithstandingany copyright notation hereon. In addition to ITA, thisresearch had been sponsored by Defense Threat ReductionAgency (DTRA).

REFERENCES

[1] R. R. Kompella, J. Yates, A. Greenberg, and A. C. Snoeren,“Detection and localization of network black holes,” in Proc. IEEE26th Int. Conf. Comput. Commun., 2007, pp. 2180–2188.

[2] S. Kandula and D. Katabi, “Shrink: A tool for failure diagnosis inip networks,” in Proc. ACM SIGCOMM Workshop Mining Netw.Data, 2005, pp. 173–178.

[3] P. Bahl, R. Ch, A. Greenberg, S. K, D. A. Maltz, and M. Zhang,“Towards highly reliable enterprise network services via infer-ence of multi-level dependencies,” in Proc. Conf. Appl., Technol.,Archit., Protocols Comput. Commun., 2007, pp. 13–24.

[4] A. Ogielski and J. Cowie. (2002). Internet routing behavior on 9/11and in the following weeks. [Online]. Available: http://www.renesys.com/tech/presentations/pdf/renesys-030502-NRC-911.pdf

[5] J. Cowie, A. Popescu, and T. Underwood. (2005). Impact of Hurri-cane Katrina on Internet infrastructure. [Online]. Available:http://www.renesys.com/tech/presentations/pdf/Renesys-Katrina-Report-9sep2005.pdf

[6] S. LaPerriere. (2007). Taiwan earthquake fiber cuts: A serviceprovider view. [Online]. Available: http://www.nanog.org/meetings/nanog39/presentations/laperriere. pdf

[7] R. R. Kompella, J. Yates, A. Greenberg, and A. C. Snoeren, “Ipfault localization via risk modeling,” in Proc. ACM Symp. Netw.Syst. Des. Implementation, 2005, pp. 57–70.

[8] M. Steinder and A. S. Sethi, “Probabilistic fault diagnosis in com-munication systems through incremental hypothesis updating,”Comput. Netw., vol. 45, no. 4, pp. 537–562, 2004.

[9] A. Sahoo, K. Kant, and P. Mohapatra, “Improving BGP conver-gence delay for large-scale failures,” in Proc. Dependable Syst.Netw., 2006, pp. 323–332.

[10] B. Bassiri and S. S. Heydari, “Network survivability in large-scaleregional failure scenarios,” in Proc. 2nd Canadian Conf. Comput. Sci.Softw. Eng., 2009, pp. 83–87.

[11] T. Bu, N. Duffield, F. L. Presti, and D. Towsley, “Network tomog-raphy on general topologies,” SIGMETRICS Perform. Eval. Rev.,vol. 30, pp. 21–30, Jun. 2002.

[12] Y. Chen, D. Bindel, H. Song, and R. H. Katz, “An algebraicapproach to practical and scalable overlay network monitoring,”in Proc. Conf. Appl., Technol., Archit. Protocols Comput. Commun.,2004, pp. 55–66.

[13] N. Duffield, “Network tomography of binary network perfor-mance characteristics,” IEEE Trans. Inform. Theory, vol. 52, no. 12,pp. 5373–5388, Dec. 2006.

[14] Y. Zhao, Y. Chen, and D. Bindel, “Towards unbiased end-to-endnetwork diagnosis,” in ACM Proc. Conf. Appl., Technol., Archit. Pro-tocols Comput. Commun., 2006, pp. 219–230.

[15] V. Padmanabhan, L. Qiu, and H. Wang, “Server-based inferenceof internet link lossiness,” in Proc. 22nd Annu. Joint Conf. IEEEComput. Commun. IEEE Soc., vol. 1, 2003, pp. 145–155, vol.1.

[16] J. Cao, D. Davis, S. V. Wiel, B. Yu, S. Vander, and W. B. Yu, “Time-varying network tomography: Router link data,” J. Am. Statist.Assoc., vol. 95, pp. 1063–1075, 2000.

[17] A. Tsang, M. Coates, and R. D. Nowak, “Network delaytomography,” IEEE Trans. Signal Process., vol. 51, pp. 2125–2136,Aug. 2003.

[18] Y. Huang, N. Feamster, and R. Teixeira, “Practical issues withusing network tomography for fault diagnosis,” SIGCOMM Com-put. Commun. Rev., vol. 38, pp. 53–58, 2008.

[19] D. Ghita, K. Argyraki, and P. Thiran, “Network tomography oncorrelated links,” in Proc. 10th ACM SIGCOMM Conf. InternetMeas., 2010, pp. 225–238.

[20] B. W. Silverman, Density Estimation for Statistics and Data Analysis.Boca Raton, FL, USA: CRC Press, 1986.

[21] S. Tati, S. Rager, and T. La Porta, “netCSI: A generic fault diagno-sis algorithm for computer networks,” Netw. Secur. Res. Center,Penn State Univ., Pennsylvania, USA, Tech. Rep. NAS-TR-0130-2010, Jun. 2010.

[22] D. Magoni, “nem: A software for network topology analysis andmodeling,” in Proc. 10th IEEE Int. Symp. Model., Anal. Simul. Com-put. Telecommun. Syst., 2002, pp. 364–371.

[23] (2003). Rocketfuel project: Internet topologies. [Online]. Available:http://www.cs.washington.edu/research/networking/rocketfuel/

[24] B. G. Mirkin, Mathematical Classification and Clustering. New York,NY, USA: Springer, 1996.

[25] S. Tati, B.-J. Ko, G. Cao, A. Swami, and T. L. Porta, “Adaptivealgorithms for diagnosing large-scale failures in computernetworks,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 3,pp. 646–656, Mar. 2015.

Srikar Tati received the bachelors of technology(B Tech Hons) degree at the Indian Institute ofTechnology, Kharagpur in 2006, the masters ofSciences (MS) degree at Pennsylvania State Uni-versity in 2008, and the PhD degree at Pennsyl-vania State University. He is currently working inEricsson R&D. His research interests includebroad areas of design, modeling and analysis ofnetwork systems, large distributed systems, wire-less networks etc. As part of his dissertation, heworked on various problems that arise due to

challenges in failure and performance diagnosis algorithms in computernetworks. He previously worked on few problems in wireless networks.He published his work in conferences such as IEEE DSN, ICNP, SRDSetc. He reviewed papers for JSAC network science, IEEE Globecom etc.

Scott Rager received the BA degree in physicsfrom the Slippery Rock University of Pennsylva-nia, the BS degree in computer engineering fromPenn State both in 2009 after completing a coop-erative dual-degree undergraduate program. Heis working towards the PhD degree in the Com-puter Science and Engineering Department atPenn State University. He is presently a memberof the Institute for Networking and SecurityResearch at Penn State, and his current researchinterests include mobile ad hoc networks and dis-

tributed protocol designs.


Bongjun Ko received the BS and MS degrees inelectrical engineering from Seoul National Uni-versity, Korea, and the PhD degree in electricalengineering at Columbia University, New York.He has worked as a senior member of researchstaff in Philips Research North America, as aresearch engineer at LG Electronics in Korea,and has co-founded NeoMTel, Korea. He is cur-rently a research staff member in IBM T. J. Wat-son Research Center, working in NetworkingTechnology Area, and has research interests in

modeling, analysis, and design of large distributed systems, mobile andwireless networks. He is the recipient of the best paper awards in IEEEICNP 2003 and in IEEE/IFIP IM 2013.

Guohong Cao received the BS degree in com-puter science from Xian Jiaotong University andthe PhD degree in computer science from theOhio State University in 1999. Since then, he hasbeen with the Department of Computer Scienceand Engineering at the Pennsylvania State Uni-versity, where he is currently a professor. He haspublished more than 200 papers in the areas ofwireless networks, wireless security, vehicularnetworks, wireless sensor networks, cache man-agement, and distributed fault tolerant comput-

ing. He has served on the editorial board of IEEE Transactions onMobile Computing, IEEE Transactions on Wireless Communications,IEEE Transactions on Vehicular Technology, and has served on theorganizing and technical program committees of many conferences,including the TPC Chair/Co-Chair of IEEE SRDS’2009, MASS’2010,and INFOCOM’2013. He was a recipient of the NSF CAREER award in2001. He is a fellow of the IEEE.

Ananthram Swami (F’08) received the BTechdegree from IIT-Bombay, the MS degree fromRice University, and the PhD degree from the Uni-versity of Southern California (USC), all in electri-cal engineering. He has held positions withUnocal Corporation, USC, CS-3 and MalgudiSystems. He was a statistical consultant to theCalifornia Lottery, developed a Matlab-basedtoolbox for non-Gaussian signal processing, andhas held visiting faculty positions at INP, Tou-louse. He is with the US Army Research Labora-

tory (ARL) where he is the Army’s ST for network science. His work is inthe broad area of network science, with emphasis on wireless communi-cation networks. He is an ARL fellow and a fellow of the IEEE.

Thomas F. La Porta received the BSEE andMSEE degrees from The Cooper Union, NewYork, and the PhD degree in electrical engineer-ing from Columbia University, New York. He is theWilliam E. Leonhard chair professor in the Com-puter Science and Engineering Department atPenn State. He joined Penn State in 2002. He isthe director of the Institute of Networking andSecurity Research at Penn State. Prior to joiningPenn State, he was with Bell Laboratories since1986. He was the director of the Mobile Network-

ing Research Department in Bell Laboratories, Lucent Technologies. Hereceived the Bell Labs Distinguished Technical Staff Award in 1996, andan Eta Kappa Nu Outstanding Young Electrical Engineer Award in 1996.He also won a Thomas Alva Edison Patent Awards in 2005 and 2009.He was the founding editor-in-chief of the IEEE Transactions on MobileComputing, and currently serves on its Steering Committee (he was achair from 2009-2010). He served as an editor-in-chief of IEEE PersonalCommunications Magazine. He was the director of Magazines for theIEEE Communications Society and was on its Board of Governors forthree years. He has published numerous papers and holds 35 patents.He is an IEEE Fellow and Bell Labs Fellow.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE …mcn.cse.psu.edu/paper/other/tdsc-tati15.pdf · with fewer false positives. The average decrease in the false positives due to the addition