49
SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA In conjunction with 2016 SIAM International Conference on Data Mining (SDM16) Workshop Chairs Maleq Khan (Virginia Tech) Christine Klymko (Lawrence Livermore National Laboratory) Larry Holder (Washington State University)

SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

SDM-Networks 2016

The Third SDM Workshop on Mining Networks and Graphs:

A Big Data Analytic Challenge

May 7, 2016, Miami, Florida, USA

In conjunction with

2016 SIAM International Conference on Data Mining (SDM16)

Workshop Chairs

Maleq Khan (Virginia Tech)Christine Klymko (Lawrence Livermore National Laboratory)

Larry Holder (Washington State University)

Page 2: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Preface

Real-world applications give rise to networks that are unstructured and often comprised of several components.Furthermore, they can support multiple dynamical processes that shape the network over time. Network sciencerefers to the broad discipline that seeks to understand the underlying principles that govern the synthesis, analysisand co-evolution of networks. In some cases, the data relevant for mining patterns and making decisions comesfrom multiple heterogeneous sources and streams in over time. Graphs are a popular representation for such databecause of their ability to represent di↵erent entity and relationship types, including the temporal relationshipsnecessary to represent the dynamics of a data stream. However, fusing such heterogeneous data into a singlegraph or multiple related graphs and mining them are challenging tasks. Emerging massive data has made suchtasks even more challenging.

This workshop will bring together researchers and practitioners in the field to deal with the emerging challengesin processing and mining large-scale networks. Such networks can be directed as well as undirected, they can belabeled or unlabeled, weighted or unweighted, and static or dynamic. Networks of networks are also of interest.Specific scientific topics of interest for this meeting include mining for patterns of interest in networks, e�cientalgorithms (sequential/parallel, exact/approximation) for analyzing network properties, methods and systems forprocessing large networks (i.e., Map-Reduce, GraphX, Giraph, etc.), use of linear algebra and numerical analysisfor mining complex networks, database techniques for processing networks, and fusion of heterogeneous datasources into graphs. Another particular topic of interest is to couple structural properties of networks to thedynamics over networks, e.g., contagions.

This day long workshop will feature a keynote address, several invited talks, contributed papers, and a paneldiscussion on open problems and directions for future research.

Page 3: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Program Committee

• Leman Akoglu (SUNY Stony Brook)

• Albert Bifet (University of Waikato and Noah’s Ark Lab)

• Rajmonda Caceres (MIT Lincoln Lab)

• Sutanay Choudhury (Pacific Northwest National Lab)

• Bill Eberle (Tennessee Technological University)

• Mohammad Al Hasan (Indiana University - Purdue University Indianapolis)

• David Kempe (University of Southern California)

• Christopher Kuhlman (Virginia Tech)

• Kamesh Madduri (Pennsylvania State University)

• Madhav Marathe (Virginia Tech)

• Ali Pinar (Sandia National Laboratories)

• Anil Vullikanti (Virginia Tech)

• Yinghui Wu (Washington State University)

Page 4: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Keynote Speaker

TBD

Invited Speakers

TBD

List of Papers

• On the Geometry and Extremal Properties of the Edge-Degeneracy ModelNicolas Kim, Dane Wilburne, Sonja Petrovic, and Alessandro Rinaldo

• Collaborative SVM Classification in Skewed Peer-to-Peer NetworksUmer Khan, Alexandros Nanopoulos, and Lars Schmidt-Thieme

• Towards Scalable Graph Analytics on Time Dependent GraphsSuraj Poudel, Roger Pearce, and Maya Gokhale

• GSK: Graph Sparsification as a Knapsack problem formulationHongyuan Zhan and Kamesh Madduri

• Trust from the past: Bayesian Personalized Ranking based Link Prediction in Knowledge GraphsBaichuan Zhang, Sutanay Choudhury, Mohammad Al Hasan, Xia Ning, Khushbu Agarwal,Sumit Purohit, and Paola Pesantez Cabrera

Page 5: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

On the Geometry and Extremal Properties of the Edge-Degeneracy Model⇤

Nicolas Kim†‡ Dane Wilburne†§ Sonja Petrovic§ Alessandro Rinaldo‡

Abstract

The edge-degeneracy model is an exponential random graphmodel that uses the graph degeneracy, a measure of thegraph’s connection density, and number of edges in a graphas its su�cient statistics. We show this model is relativelywell-behaved by studying the statistical degeneracy of thismodel through the geometry of the associated polytope.

Keywords exponential random graph model, degeneracy,

k-core, polytope

1 Introduction

Statistical network analysis is concerned with develop-ing statistical tools for assessing, validating and model-ing the properties of random graphs, or networks. Thevery first step of any statistical analysis is the formal-ization of a statistical model, a collection of probabil-ity distributions over the space of graphs (usually, ona fixed number of nodes n), which will serve as a ref-erence model for any inferential tasks one may want toperform. Statistical models are in turn designed to beinterpretable and, at the same time, to be capable of re-producing the network characteristics pertaining to theparticular problem at hand. Exponential random graphmodels, or ERGMs, are arguably the most importantclass of models for networks with a long history. Theyare especially useful when one wants to construct mod-els that resemble the observed network, but without theneed to define an explicit network formation mechanism.In the interest of space, we single out classical references[2], [4], [9] and a recent review paper [8].

Central to the specification of an ERGM is thechoice of su�cient statistic, a function on the spaceof graphs, usually vector-valued, that captures theparticular properties of a network that are of scientificinterest. Common examples of su�cient statistics arethe number of edges, triangles, or k-stars, the degreesequence, etc; for an overview, see [8]. The choice ofa su�cient statistic is not to be taken for granted: itdepends on the application at hand and at the same time

⇤Partially supported by AFOSR grant #FA9550-14-1-0141.†Equal contribution‡Department of Statistics, Carnegie Mellon University, Email:

[email protected], [email protected].§Department of Applied Mathematics, Illinois In-

stitute of Technology, Email: [email protected],[email protected].

it dictates the statistical and mathematical behavior ofthe ERGM. While there is not a general classification of‘good’ and ‘bad’ network statistics, some lead to modelsthat behave better asymptotically than others, so thatcomputation and inference on large networks can behandled in a reliable way.

In an ERGM, the probability of observing any givengraph depends on the graph only through the value of itssu�cient statistic, and is therefore modulated by howmuch or how little the graph expresses those propertiescaptured by the su�cient statistics. As there is virtuallyno restriction on the choice of the su�cient statistics,the class of ERGMs therefore possesses remarkableflexibility and expressive power, and o↵ers, at least inprinciple, a broadly applicable and statistically soundmeans of validating any scientific theory on real-lifenetworks. However, despite their simplicity, ERGMsare also di�cult to analyze and are often thought tobehave in pathological ways, e.g., give significant massto extreme graph configurations. Such properties areoften referred to as degeneracy; here we will refer to itas statistical degeneracy [10] (not to be confused withgraph degeneracy below). Further, their asymptoticproperties are largely unknown, though there has beensome recent work in this direction; for example, [6] o↵era variation approach, while in some cases it has beenshown that their geometric properties can be exploitedto reveal their extremal asymptotic behaviors [22], seealso [18]. These types of results are interesting not onlymathematically, but have statistical value: they providea catalogue of extremal behaviors as a function of themodel parameters and illustrate the extent to whichstatistical degeneracy may play a role in inference.

In this article we define and study the propertiesof the ERGM whose su�cient statistics vector consistsof two quantities: the edge count, familiar to and of-ten used in the ERGM family, and the graph degen-eracy, novel to the statistics literature. (These quanti-ties may be scaled appropriately, for purpose of asymp-totic considerations; see Section 2.) As we will see,graph degeneracy arises from the graph’s core struc-ture, a property that is new to the ERGM framework[11], but is a natural connectivity statistic that gives asense of how densely connected the most important ac-tors in the network are. The core structure of a graph

Page 6: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

(see Definition 2.1) is of interest to social scientists andother researchers in a variety of applications, includingthe identification and ranking of influencers (or “spread-ers”) in networks (see [14] and [1]), examining robust-ness to node failure, and for visualization techniques forlarge-scale networks [5]. The degeneracy of a graph issimply the statistic that records the largest core.

Cores are used as descriptive statistics in severalnetwork applications (see, e.g., [16]), but until recently,very little was known about statistical inference fromthis type of graph property: [11] shows that cores areunrelated to node degrees and that restricting graphdegeneracy yields reasonable core-based ERGMs. Yet,there are currently no rigorous statistical models fornetworks in terms of their degeneracy. The results inthis paper thus add a dimension to our understatingof cores by exhibiting the behavior of the joint edge-degeneracy statistic within the context of the ERGMthat captures it, and provide extremal results critical toestimation and inference for the edge-degeneracy model.

We define the edge-degeneracy ERGM in Section 2,investigate its geometric structure in Sections 3, 4, and5, and summarize the relevance to statistical inferencein Section 6.

2 The edge-degeneracy (ED) model

This section presents the necessary graph-theoreticaltools, establishes notation, and introduces the EDmodel. Let Gn denote the space of (labeled, undirected)

simple graphs on n nodes, so |Gn| = 2(n2).

To define the family of probability distributionsover Gn comprising the ED model, we first define thedegeneracy statistic.

Definition 2.1. Let G = (V,E) be a simple, undi-

rected graph. The k-core of G is the maximal subgraph

of G with minimum degree at least k. Equivalently, the

k-core of G is the subgraph obtained by iteratively delet-

ing vertices of degree less than k. The graph degeneracyof G, denoted degen(G), is the maximum value of k for

which the k-core of G is non-empty.

This idea is illustrated in Figure 1, which shows agraph G and its 2-core. In this case, degen(G) = 4.

Figure 1: A small graph G (left) and its 2-core (right).The degeneracy of this graph is 4.

The edge-degeneracy ERGM is the statistical modelon Gn whose su�cient statistics are the rescaled graphdegeneracy and the edge count of the observed graph.Concretely, for G 2 Gn let

t(G) =

E(G)/

n

2

, degen(G)/(n� 1)

,(2.1)

where E(G) is the number of edges of G. The ED modelon Gn is the ERGM {Pn,✓, ✓ 2 R2}, where

Pn,✓(G) = exp {h✓, t(G)i � (✓)}(2.2)

is the probability of observing the graph G 2 Gn for thechoice of model parameter ✓ 2 R2. The log-partitionfunction : R2 ! R, given by (✓) =

P

G2Gne

h✓,t(G)i

serves as a normalizing constant, so that probabilitiesadd up to 1 for each choice of ✓ (notice that (✓) < 1for all ✓, as Gn is finite).

Notice that di↵erent choices of ✓ = (✓1

, ✓

2

) will leadto rather di↵erent distributions. For example, for largeand positive values of ✓

1

and ✓

2

the probability massconcentrates on dense graphs, while negative valuesof the parameters will favor sparse graphs. Moreinterestingly, when one parameter is positive and theother is negative, the model will favor configurations inwhich the edge and degeneracy count will be balancedagainst each other. Our results in Section 5 will providea catalogue of such behaviors in extremal cases and forlarge n.

The normalization of the degeneracy and the edgecount in (2.1) and the presence of the coe�cient

�n2

in the ED probabilities (2.2) are to ensure a non-trivial limiting behavior as n ! 1, since E(G) anddegen(G) scale di↵erently in n (see, e.g., [6] and [22]).This normalization is not strictly necessary for ourtheoretical results to hold. However, the ED model, likemost ERGMs, is not consistent, thus making asymptoticconsiderations somewhat problematic.

Lemma 2.1. The edge-degeneracy model is an ERGM

that is not consistent under sampling, as in [21].

Proof. The range of graph degeneracy values whengoing from a graph with n vertices to one with n + 1vertices depends on the original graph; e.g. if there isa 2-star that is not a triangle in a graph with threevertices, the addition of another vertex can form atriangle and increase the graph degeneracy to 2, but ifthere was not a 2-star then there is no way to increasethe graph degeneracy. Since the range is not constant,this ERGM is not consistent under sampling.

Thus, as the number of vertices n grows, it is importantto note the following property of the ED model, not

Page 7: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

uncommon in ERGMs: inference on the whole networkcannot be done by applying the model to subnetworks.

In the next few sections we will study the geometryof the ED model as a means to derive some of itsasymptotic properties. The use of polyhedral geometryin the statistical analysis of discrete exponential familiesis well established: see, e.g., [2], [4], [7], [20], [19].

3 Geometry of the ED model polytope

The edge-degeneracy ERGM (2.2) is a discrete expo-nential family, for which the geometric structure of themodel carries important information about parameterestimation including existence of maximum likelihoodestimate (MLE) - see above mentioned references. Thisgeometric structure is captured by the model polytope.

The model polytope Pn of the ED model on Gn isthe convex hull of the set of all possible edge-degeneracypairs for graphs in Pn. In symbols,

Pn := convn

(E(G), degen(G)), G 2 Gn

o

⇢ R2

.

Note the use of the unscaled version of the su�cientstatistics in defining the model polytope. In this section,the scaling used in model definition (2.1) has littleimpact on shape of Pn, thus - for simplicity of notation -we do not include it in the definition of Pn. The scalingfactors will be re-introduced, however, when we considerthe normal fan and the asymptotics in Section 4.

In the following, we characterize the geometricproperties of Pn that are crucial to statistical inference.First, we arrive at a startling result, Proposition 3.1,that every integer point in the model polytope is a re-alizable statistic. One implication of this is obtained inconjunction with other asymptotic results discussed be-low. Second, Proposition 3.3 implies that the observednetwork statistics will almost surely lie in the relativeinterior of the model polytope, which is an importantproperty because estimation algorithms are guaranteedto behave well when o↵ the boundary of the polytope.This also implies that the MLE for the edge-degeneracyERGM exists almost surely for large graphs. In otherwords, there are very few network observations that canlead to statistical degeneracy, that is, bad behavior ofthe model for which some ERGMs are famous. That be-havior implies that a subset of the natural parametersis non-estimable, making complete inference impossible.Thus it being avoided by the edge-degeneracy ERGMis a desirable outcome. In summary, Propositions 3.3,3.1, 3.4 and Theorem 3.1 completely characterize thegeometry of Pn and thus solve [17, Problem 4.3] for thisparticular ERGM. Remarkably, this problem—althoughcritical for our understanding of reliability of inferencefor such models– has not been solved for most ERGMs

except, for example, the beta model [20], which reliedheavily on known graph-theoretic results.

Let us consider Pn for some small values of n. Thepolytope P

10

is plotted in Figure 2.

The case n = 3. There are four non-isomorphic graphson 3 vertices, and each gives rise to a distinct edge-degeneracy vector:

t

!

= (0, 0) t

!

= (1, 1)

t

!

= (2, 1) t

!

= (3, 2)

Hence P3

= conv {(0, 0), (1, 1), (2, 1), (3, 2)}. Note thatin this case, each realizable edge-degeneracy vector lieson the boundary of the model polytope. We will seebelow that n = 3 is the unique value of n for which thereare no realizable edge-degeneracy vectors contained inthe relative interior of Pn.

The case n = 4. On 4 vertices there are 11 non-isomorphic graphs but only 8 distinct edge-degeneracyvectors. Without listing the graphs, the edge-degeneracy vectors are:

(0, 0), (1, 1), (2, 1), (3, 1), (3, 2), (4, 2), (5, 2), (6, 3).

Here we pause to make the simple observation thatPn ⇢ Pn+1

always holds. Indeed, every realizableedge-degeneracy vector for graphs on n vertices is alsorealizable for graphs on n + 1 vertices, since addinga single isolated vertex to a graph a↵ects neither thenumber of edges nor the graph degeneracy.

The case n = 5. There are 34 non-isomorphic graphson n = 5 vertices but only 15 realizable edge-degeneracyvectors. They are:

(0, 0), (1, 1), (2, 1), (3, 1), (3, 2), (4, 2), (5, 2), (6, 3),

(4, 1), (6, 2), (7, 2), (7, 3), (8, 3), (9, 3), (10, 4),

where the pairs listed on the top row are containedin P

4

and the pairs on the second row are containedin P

5

\ P4

. Here we make the observation that theproportion of realizable edge-degeneracy vectors lyingon the interior of the Pn seems to be increasing with n.This phenomenon is addressed in Proposition 3.3 below.Figure 2 depicts the integer points that define P

10

.

Page 8: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

0 5 10 15 20 25 30 35 40 450

1

2

3

4

5

6

7

8

9

Edges

De

ge

ne

racy

P10

Figure 2: The integer points that define the modelpolytope P

10

.

The case for general n. It is well known in the theoryof discrete exponential families that the MLE exists ifand only if the average su�cient statistic of the samplelies in the relative interior of the model polytope. Thisleads us to investigate which pairs of integer pointscorrespond to realizable edge-degeneracy vectors.

Proposition 3.1. Every integer point contained in Pn

is a realizable edge-degeneracy vector.

Proof. Suppose that G is a graph on n vertices withdegen(G) = d n � 1. Let Un(d) be the minimumnumber of edges over all such graphs and let Ln(d) bethe maximum number of edges over all such graphs.Our strategy will be to show that for all e such thatUn(d) e Ln(d), there exists a graph G on n verticessuch that degen(G) = d and E(G) = e.

First, observe that if degen(G) = d, then thereare at least d + 1 vertices of G in the d-core. Usingthis observation, it is not di�cult to see that Un(d) =�d+1

2

. This is the minimum number of edges required toconstruct a graph with a non-empty d-core; hence, theupper boundary of Pn consists of the points (

�d+1

2

, d)for 0 d n � 1. Further, it is clear that there isexactly one graph (up to isomorphism) correspondingto the edge-degeneracy vector (

�d+1

2

, d); it is the graph

Kd+1

+K

1

+ . . .+K

1

| {z }

n�d�1 times

,(3.3)

i.e., the complete graph on d + 1 vertices along withn� d� 1 isolated vertices.

It is an immediate consequence of [11, Proposition11] that Ln(d) =

�d+1

2

+(n�d�1) ·d. Thus, for each e

such that Un(d) =�d+1

2

< e <

�d+1

2

+ (n� d� 1) · d =

Ln(d), we must show how to construct a graph G

on n vertices with graph degeneracy d and e edges.Call the resulting graph Gn,d,e. To construct Gn,d,e,start with the graph in 3.3. Label the isolated verticesv

1

, . . . , vn�d�1

. For each j such that 1 j e��d+1

2

,add the edge ej by making the vertex vi such thati ⌘ j mod n � d � 1 adjacent to an arbitrary vertexof Kd+1

. This process results in a graph with exactlye =

�d+1

2

+ j edges, and since j < (n � d � 1) · d,our construction guarantees that we have not increasedthe graph degeneracy. Hence, we have constructedGn,d,e. Combined with Lemma 3.1 below, this provesthe desired result.

The preceding proof also shows the following:

Proposition 3.2. Pn contains exactly

n�1

X

d=0

[(n� d� 1) · d+ 1](3.4)

integer points. This is the number of realizable edge-

degeneracy vectors for every n.

The following property is useful throughout:

Lemma 3.1. Pn is rotationally symmetric.

Proof. For n � 3 and d 2 {1, . . . , n� 1},

Ln(d)� Ln(d� 1) = Un(n� d)� Un(n� d� 1).

Note that the center of rotation is the point ((n �1)n/4, (n � 1)/2). The rotation is 180 degrees aroundthat point.

As mentioned above, the following nice property ofthe ED model polytope implies that the MLE for theED model exists almost surely for large graphs.

Proposition 3.3. Let pn denote the proportion of re-

alizable edge-degeneracy vectors that lie on the relative

interior of Pn. Then,

limn!1

pn = 1.

Proof. This result follows from analyzing the formulain 3.4 and uses the following lemma:

Lemma 3.2. There are 2n�2 realizable lattice points on

the boundary of Pn, and each is a vertex of the polytope.

Proof. We know that there are 2n� 2 lattice points onthe boundary of Pn. We show that they are all verticesof Pn. Note that these will be the only vertices of Pn,since Pn is the closure of the convex hull of a set of

Page 9: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

lattice points S, so if S contains any lattice points, theywill certainly all be in Pn.

First, since Pn ✓ Z+ ⇥ Z+, and (0, 0) 2 Pn for alln, we know that (0, 0) must be a vertex of Pn. By therotational symmetry of Pn, (n� 1, (n� 1)n/2) must bea vertex, too.

It is su�cient to show that�U (d) := Un(d)�Un(d�1) satisfies �U (d) 6= �U (d� 1) for all d 2 {2, 3, . . . , n�1}; this is because of the rotational symmetry of Pn.Since �U (d) = d, and d 6= d�1 for any d, each point onthe upper boundary must be a vertex of the polytope.This proves the lemma.

To prove the proposition, we then compute:

pn =

Pn�1

d=0

[(n� d� 1) · d+ 1]� (2n� 2)Pn�1

d=0

[(n� d� 1) · d+ 1]! 1.

3.1 Extremal points of Pn. Now we turn our at-tention to the problem of identifying the graphs corre-sponding to extremal points of Pn. Clearly, the bound-ary point (0, 0) is uniquely attained by the empty graphKn and the boundary point (

�n2

, n� 1) is uniquely at-tained by the compete graph Kn. The proof of Propo-sition 3.1 shows that the unique graph correspondingto the upper boundary point (Un(d), d) is a completegraph on d+1 vertices union n�d�1 isolated vertices.The lower boundary graphs are more complicated, butthe graphs corresponding to two of them are classifiedin the following proposition.

Proposition 3.4. The graphs corresponding to the

lower boundary point (Ln(1), 1) of Pn are exactly the

trees on n vertices. The unique graph corresponding to

the lower boundary point (Ln(n� 2), n� 2) is the com-

plete graph on n vertices minus an edge.

Proof. First we consider graphs with edge-degeneracyvector (Ln(1), 1). By the formula given in the proof ofProposition 3.1, Ln(1) =

2

2

+ (n � 1 � 1) = n � 1.A graph with degeneracy 1 must be acyclic, sinceotherwise it would have degeneracy at least 2. By awell known classification theorem for trees, an acyclicgraph on n vertices with n�1 edges is a tree. Similarly,a graph corresponding to the lower boundary point(Ln(n�2), n�2) has

�n�1

2

+(n� (n�2)�1)(n�2) =�n2

� 1 edges. There is only one such graph: thecomplete graph on n vertices minus an edge.

The graphs corresponding to extremal points(Ln(d), d) for 2 d n � 3 are called maximally d-

degenerate graphs and were studied extensively in [3].Such graphs have many interesting properties, but arequite di�cult to fully classify or enumerate.

In the following theorem, we show that the loweredge of Pn is like a mirrored version of the upper edge.This partially characterizes the remaining maximally d-degenerate graphs.

Theorem 3.1. Consider G(Un(d)) 2 Gn to be the

unique graph on n nodes with degeneracy d that has the

minimum number of edges, given by Un(d). Similarly,

let G(Ln(d)) ⇢ Gn be the set of graphs on n nodes with

degeneracy d that have the maximum number of edges,

given by Ln(d). Then, for all d 2 {0, 1, . . . , n� 1},

G(Un(d)) 2 G(Ln(n� (d+ 1))).

Proof. As we know, G(Un(d)) = Kd+1

+ Kn�(d+1)

.Taking the complement,

(3.5) G(Un(d)) = Kd+1

[Kn�(d+1)

.

We only need to show that this has Ln((n�1)�d) edgesand graph degeneracy (n� 1)� d.

It is straightforward to show that (3.5) has theright number of edges. Since G(Un(d)) has

�d+1

2

edges,

G(Un(d)) must have�n2

��d+1

2

= Ln((n�1)�d) edges.As for the graph degeneracy, we know that

Kn�(d+1)

must be a subgraph of G(Un(d)) from thelast equality of (3.5). Since for any a and b we haveKa [ Kb = Ka+b by definition of the complete graph,and

K

1

[Kn�(d+1)

Kd+1

[Kn�(d+1)

= G(Un(d))

by its construction, we have that Kn�(d+1)+1

is a sub-

graph of G(Un(d)), and therefore degen(G(Un(d))) �n� (d+1). However, degen(G(Un(d))) < n� (d+1)+1because G(Un(d)) \ Kn�(d+1)

contains no edges; i.e.,it is impossible for Kn�(d+1)+m to be a subgraph of

G(Un(d)) for any m > 1, so degen(G(Un(d))) = n �(d+ 1), as desired.

4 Asymptotics of the ED model polytope and

its normal fan

Since we will let n ! 1 in this section, it will benecessary to rescale the polytope Pn so that it iscontained in [0, 1]2, for each n. Thus, we divide thegraph degeneracy parameter by n � 1 and the edgeparameter by

�n2

, as we have already done in (2.1).While this rescaling has little impact on shape of

Pn discussed in Section 3, it does a↵ect its normal fan,a key geometric object that plays a crucial role in oursubsequent analysis. We describe the normal fan of thenormalized polytope next.

Proposition 4.1. All of the perpendicular directions

to the faces of Pn are:

{±(1,�m) : m 2 {1, 2, . . . , n� 1}}.

Page 10: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

So, after normalizing, we get that the directions are

{±(1,� 2

↵n

) : ↵ 2 {1/(n� 1), 2/(n� 1), . . . , 1}}.

Proof. The slopes of the line segments defining eachface of Pn are 1/�U (d) = 1/d in the unnormalizedparametrization. To get the slopes of the normalizedpolytope, just multiply each slope by

�n2

/(n�1) = n/2.

Our next goal is to describe the limiting shape ofthe normalized model polytope and its normal fan asn ! 1. We first collect some simple facts about thelimiting behavior of the normalized graph degeneracyand edge count.

Proposition 4.2. If ↵ 2 [0, 1] such that ↵(n� 1) 2 N(so that ↵ parameterizes the normalized graph degener-

acy), then

limn!1

Un(↵(n� 1))�n2

� = ↵

2

.

Furthermore, due to the rotational symmetry of Pn,

limn!1

Ln(↵(n� 1))�n2

� = 1� (1� ↵)2.

Proof. By definition, Un(↵(n � 1)) = ↵2

2

n

2 + o(n).Hence,

Un(↵(n� 1))�n2

� =↵2

2

n

2 + o(n)n(n�1)

2

! ↵

2

.

We can now proceed to describe the set limit cor-responding to the sequence {Pn}n of model polytopes.Let

P = cl�

t 2 R2 : t = t(G), G 2 Gn, n = 1, 2, . . .

be the closure of the set of all possible realizablestatistics (2.1) from the model. Using Propositions 4.1and 4.2 we can characterize P as follows.

Lemma 4.1. 1. Pn ⇢ P for all n and limn Pn = P.

2. Let L and U be functions from [0, 1] into [0, 1] givenby

L(x) = 1�p1� x and U(x) =

px.

Then,

P =�

(x, y) 2 [0, 1]2 : L(x) y U(x)

.

The convex set P is depicted at the top of Figure 4.In order to study the asymptotics of extremal

properties of the ED model, the final step is to describe

all the normals to P. As we will see in the next section,these normals will correspond to di↵erent extremalbehaviors of the model. Towards this end, we definethe following (closed, pointed) cones

C; = cone {(1,�2), (�1, 0)} ,C

complete

= cone {(1, 0), (�1, 2)} ,C

U

= cone {(�1, 0), (�1, 2)} ,C

L

= cone {(1, 0), (1,�2)} ,

where, for A ⇢ R2, cone(A) denote the set of all conic(non-negative) combinations of the elements in A. It isclear that C; and C

complete

are the normal fan to thepoints (0, 0) and (1, 1) of P. As for the other two cones,it is not hard to see that the set of all normal rays tothe edges of the upper, resp. lower, boundary of Pn forall n are dense in CU , resp. CL. As we will show inthe next section, the regions C; and C

complete

indicatedirections that, for large n, of statistical degeneracytowards the empty and complete graphs, respectively.On the other hand CU and CL contain directions ofnon-trivial convergence to extremal configurations ofmaximal and minimal graph degeneracy. See Figure 3and the middle and lower part of Figure 4.

θE

θD

Figure 3: The green regions indicate directions ofnontrivial convergence. The bottom-left and top-rightred regions indicate directions towards the empty graphKn and complete graph Kn, respectively.

5 Asymptotical Extremal Properties of the ED

Model

In this section we will be describe the behavior ofdistributions from the ED model of the form Pn,�+rd,where d is a non-zero point in R2 and r a positivenumber. In particular, we will consider the case in whichd and � are fixed, but n and r are large (especiallyr). We will show that there are four possible typesof extremal behavior of the model, depending on d,the “direction” along which the distribution becomes

Page 11: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Edge

Deg

ener

acy

Figure 4: (Top) the sequence of normalized polytopes{Pn}n converges outwards, starting from P

3

in thecenter. (Middle) and (bottom) are the representativeinfinite graphs along points on the upper and lowerboundaries of P, respectively, depicted as graphons [15]for convenience.

extremal (for fixed n and as r grows unbounded).This dependence can be loosely expressed as follows:each d will identify one and only one value ↵(d) ofthe normalized edge-degeneracy pairs such that, for alln and r large enough, Pn,�+rd will concentrate onlyon graphs whose normalized edge-degeneracy value isarbitrarily close to ↵(d).

In order to state this result precisely, we will firstneed to make some elementary, yet crucial, geometricobservations. Any d 2 R2 defines a normal directionto one point on the boundary of P. Therefore, eachd 2 R2 identifies one point on the boundary of P, which

we will denote ↵(d). Specifically, any d 2 C; is in thenormal cone to the point (0, 0) 2 P, so that ↵(d) = (0, 0)(the normalized edge-degeneracy of the empty graph)for all d 2 C; (and those points only). Similarly,any d 2 C

complete

is in the normal cone to the point(1, 1) 2 P, and therefore ↵(d) = (1, 1) (the normalizededge-degeneracy of Kn) for all d 2 C

complete

(and thosepoints only). On the other hand, if d 2 int(CL), thend is normal to one point on the upper boundary of P.Assuming without loss of generality that d = (1, a), ↵(d)is the point (x, y) along the curve {L(x), x 2 [0, 1]} suchthat L0(x) = � 1

a . Notice that, unlike the previous cases,if d and d

0 are distinct points in int(CL) that are notcollinear, ↵(d) 6= ↵(d0). Analogous considerations holdfor the points d 2 int(CU ): non-collinear points map todi↵erent points along the curve {U(x), x 2 [0, 1]}.

With these considerations in mind, we now presentour main result about the asymptotics of extremalproperties of the ED model.

Theorem 5.1. Let d 6= 0 and consider the following

cases.

• d 2 int(C;).Then, for any � 2 R2

and arbitrarily small ✏ 2(0, 1) there exists a n(✏) such that for all n � n(✏)there exists a r = r(✏, n) such that, for all r �r(✏, n) the empty graph has probability at least 1� ✏

under Pn,�+rd.

• d 2 int(Ccomplete

).Then, for any � 2 R2

and arbitrarily small ✏ 2(0, 1) there exists a n(✏) such that for all n � n(✏)there exists a r = r(✏, n) such that, for all r �r(✏, n) the complete graph has probability at least

1� ✏ under Pn,�+rd.

• d 2 int(CL).Then, for any � 2 R2

and arbitrarily small ✏, ⌘ 2(0, 1) there exists a n(✏) such that for all n � n(✏, ⌘)there exists a r = r(✏, n) such that, for all r �r(✏, ⌘, n) the set of graphs in Gn whose normalized

edge-degeneracy is within ⌘ of ↵(d) has probability

at least 1� ✏ under Pn,�+rd.

• d 2 int(CU ).Then, for any � 2 R2

and arbitrarily small ✏, ⌘ 2(0, 1) there exists a n(✏) such that for all n � n(✏, ⌘)there exists a r = r(✏, n) such that, for all r �r(✏, ⌘, n) the set of graphs in Gn whose normalized

edge-degeneracy is within ⌘ of ↵(d) has probability

at least 1� ✏ under Pn,�+rd.

Remarks. We point out that the directions along theboundaries of C; and C

complete

are not part of our

Page 12: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

results. Our analysis can also accommodate those cases,but in the interest of space, we omit the results. Moreimportantly, the value of � does not play a role in thelimiting behavior we describe. We further remark thatit is possible to formulate a version of Theorem 5.1 foreach finite n, so that only r varies. In that case, byProposition 4.1, for each n there will only be 2(n � 1)possible extremal configurations, aside from the emptyand fully connected graphs. We have chosen instead tolet n vary, so that we could capture all possible cases.

Proof. We only sketch the proof for the case d 2int(CL), which follows easily from the arguments in [22],in particular Propositions 7.2, 7.3 and Corollary 7.4.The proofs of the other cases are analogous. First, weobserve that the assumption (A1)-(A4) from [19] holdfor the ED model. Next, let n be large enough suchthat d is not in the normal cone corresponding to thepoints (0, 0) and (1, 1) of Pn. Then, for each such n, deither defines a direction corresponding to the normalof an edge, say en, of the upper boundary of Pn or d isin the interior of the normal cone to a vertex, say vn,of Pn. Since Pn ! P, n can be chosen large enough sothat either the vertices of en or vn (depending on whichone of the two cases we are facing) are within ⌘ of ↵(d).

Let us first consider the case when d is normal tothe edge vn of Pn. Since every edge of Pn containsonly two realizable pairs of normalized edge count andgraph degeneracy, namely its endpoints, using the resultin [22], one can choose r = r(n, ✏, ⌘) large enough sothat at least 1 � ✏ of the mass probability of Pn,�+rd

concentrates on the graphs in Gn whose normalizededge-degeneracy vector is either one of the two verticesin en. The claim follows from the fact that these verticesare within ⌘ of ↵(q) For the other case in which d isin the interior of the normal cone to the vertex vn,again the results in [22] yield that one can can chooser = r(n, ✏, ⌘) large enough so that at least 1 � ✏ ofthe mass probability of Pn,�+rd concentrates on graphsin Gk whose normalized edge-degeneracy vector is vn.Since vn is within ⌘ of ↵(q) we are done.

The interpretation of Theorem 5.1 is as follows.If d is a non-zero direction in C; and C

complete

, thenPn,�+rd will exhibit statistical degeneracy regardless of� and for large enough r, in the sense that it willconcentrate on the empty and fully connected graphs,respectively. As shown in Figure 3, C; and C

complete

are fairly large regions, so that one may in fact expectstatistical degeneracy to occur prominently when themodel parameters have the same sign. The extremaldirections in CU and CL yields instead non-trivialbehavior for large n and r. In this case, Pn,�+rd willconcentrate on graph configurations that are extremal

in sense of exhibiting nearly maximal or minimal graphdegeneracy given the number of edges.

Taken together, these results suggest that careis needed when fitting the ED model, as statisticaldegeneracy appears to be likely.

6 Discussion

The goal of this paper is to introduce a new ERGM anddemonstrate its statistical properties and asymptoticbehavior captured by its geometry. The ED model isbased on two graph statistics that are not commonlyused jointly and capture complementary informationabout the network: the number of edges and the graphdegeneracy. The latter is extracted from importantinformation about the network’s connectivity structurecalled cores and is often used as a descriptive statistic.

The exponential family framework provides a beau-tiful connection between the model geometry and itsstatistical behavior. To that end, we completely charac-terized the model polytope in Section 3 for finite graphsand Section 4 for the limiting case as n ! 1. Themost obvious implication of the structure of the EDmodel polytope is that the MLE exists almost surelyfor large graphs. Another is that it simplifies greatlythe problem of projecting noisy data onto the polytopeand finding the nearest realizable point, as one needonly worry about the projection. Such projections playa critical role in data privacy problems, as they are usedin computing a private estimator of the released datawith good statistical properties; see [9]. In fact, our ge-ometric results imply that [11, Problem 4.5] is easier forthe ED model than the beta model based on node de-grees, which was solved in [8]. Finally, the structure ofthe polytope and its normal fan reveal various extremalbehaviors of the model, discussed in Section 5.

Note that the two statistics in the ED model sum-marize very di↵erent properties of the observed graph,giving this seemingly simple model some expressivepower and flexibility. In graph-theoretic terms, the de-generacy summarizes the core structure of the graph,within which there can be few or many edges (see [11]for details); combining it with the number of edges pro-duces Erdos-Renyi as a submodel.

As discussed in Section 2, di↵erent choices of theparameter vector, that is, values of the edge-degeneracypair, lead to rather di↵erent distributions, from sparseto dense graphs as both parameters are negative orpositive, respectively, as well as graphs where edge countand degeneracy are balanced for mixed-sign parametervectors. Our results in Section 5 provide a catalogueof such behaviors in extremal cases and for large n.The asymptotic properties we derive o↵er interestinginsights on the extremal asymptotic behavior of the

Page 13: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

ED model. However, the asymptotic properties of non-extremal cases, that is, those of distributions of theform Pn,� for fixed � and diverging n, remain completelyunknown. While this an exceedingly common issue withERGMs, whose asymptotics are extremely di�cult todescribe, it would nonetheless be desirable to gain abetter understanding of the ED model when the networkis large. In this regard, the variation approach putforward by [6], which provides a way to resolve theasymptotics of ERGMs in general, may be an interestingdirection to pursue in future work.

References

[1] J. Bae and S. Kim, Identifying and ranking influentialspreaders in complex networks by neighborhood core-ness, Physica A: Statistical Mechanics and its Appli-cations, 395 (2014), pp. 549-559.

[2] O. Barndor↵-Nielsen, Information and exponentialfamilies in statistical theory, Wiley (1978).

[3] A. Bickle, The k-cores of a graph, Ph.D. dissertation,Western Michigan University (2010).

[4] L. D. Brown, Fundamentals of statistical exponen-tial families, IMS Lecture Notes, Monograph Series 9(1986).

[5] S. Carmi and S. Havlin and S. Kirkpatrick andY. Shavitt and E. Shir,A model of Internet topol-ogy using k-shell decomposition, Proceedings of the Na-tional Academy of Sciences, 104(27) (2007), pp. 11150-11154.

[6] S. Chatterjee and P. Diaconis, Estimating and under-standing exponential random graph models, Annals ofStatistics, 41(5) (2013), pp. 2428–2461.

[7] S. E. Fienberg and A. Rinaldo, Maximum likelihoodestimation in log-linear models, Annals of Statistics,40(2) (2012).

[8] A. Goldenberg and A. X. Zheng and S. E. Fienbergand E. M. Airoldi, A survey of statistical networkmodels, Foundations and Trends in Machine Learning2(2) (2009), pp. 129–233.

[9] S. M. Goodreau, Advances in exponential random graph(p⇤) models applied to a large social network, SpecialSection: Advances in Exponential Random Graph (p⇤)Models, Social Networks 29(2) (2007), pp. 231-248.

[10] M. S. Handcock, Assessing degeneracy in statisticalmodels of social networks, working paper no. 39, Centerfor Statistics and the Social Sciences, University ofWashington (2003).

[11] V. Karwa and M. J. Pelsmajer and S. Petrovic andD. Stasi and D. Wilburne, Statistical models for coresdecomposition of an undirected random graph, preprintarXiv:1410.7357 (v2) (2015).

[12] V. Karwa and A. Slavkovic, Inference using noisydegrees: di↵erentially private �-model and syntheticgraphs, Annals of Statistics 44(1) (2016), pp. 87–112.

[13] V. Karwa and A. Slavkovic and P. Krivitsky, Di↵er-entially private exponential random graphs, Privacy in

Statistical Databases, Lecture Notes in Computer Sci-ence, Springer (2014), pp. 142-155.

[14] M. Kitsak and L.K. Gallos and S. Kirkpatrick andS. Havlin and F. Liljeros and L. Muchnik andH.E. Stanley and H. Makse, Identification of influentialspreaders in complex networks, Nature Physics, 6(11)(2010), pp. 880–893.

[15] L. Lovasz, Large networks and graph limits, AmericanMathematical Society (2012).

[16] S. Pei and L. Muchnik and J. Andrade Jr. and Z. Zhengand H. Maske, Searching for superspreaders if infor-mation in real-world social media, Nature Scientific Re-ports, (2012).

[17] S. Petrovic, A survey of discrete methods in (algebraic)statistics for networks, Proceedings of the AMS Spe-cial Session on Algebraic and Geometric Methods inDiscrete Mathematics, Heather Harrington, MohamedOmar, and Matthew Wright (editors), ContemporaryMathematics (CONM) book series, American Mathe-matical Society, to appear.

[18] A. Razborov, On the minimal density of triangles ingraphs, Combin. Probab. Comput. 17 (2008), pp. 603–618.

[19] A. Rinaldo and S. E. Fienberg and Y. Zhou, On the ge-ometry of discrete exponential families with applicationto exponential random graph models, Electronic Jour-nal of Statistics, 3 (2009), pp. 446–484.

[20] A. Rinaldo and S. Petrovic and S. E. Fienberg, Max-imum likelihood estimation in the �-model, Annals ofStatistics 41(3) (2013), pp. 1085–1110.

[21] C. R. Shalizi and A. Rinaldo, Consistency under sam-pling of exponential random graph models, Annals ofStatistics, 41(2) (2013), pp. 508–535.

[22] M. Yin and A. Rinaldo and S. Fadnavis, Asymptoticquantization of exponential random graphs, to appearin the Annals of Applied Probability (2016).

Page 14: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Collaborative SVM Classification in Skewed Peer-to-Peer Networks

Umer Khan

⇤Alexandros Nanopoulos

†Lars Schmidt-Thieme

Abstract

Mining patterns from large-scale networks, such as Peer-to-

Peer (P2P), is a challenging task, because centralization of

data is not feasible. The goal is to develop distributed min-

ing algorithms that are communication e�cient, scalable,

asynchronous, robust to peer dynamism, and which achieve

accuracy as close as possible to centralized ones. In this pa-

per, we present a collaborative classification method based

on Support Vector Machines in large scale P2P networks.

Most of the widely known P2P systems are prone to Data or

Execution skew either due to emergence of small-world net-

work topologies in their overlays, or hierarchical data place-

ments requirements. In such an inherently skewed distribu-

tion of node links and eventually data, existing approaches of

sharing local machine learning models only with immediate

neighbors may deprive a vast majority of scarcely connected

peers of knowledge that can improve their local prediction

accuracies. The main idea of the proposed approach is to

allow communication beyond the direct neighbors in a con-

trolled way that improves the classification accuracy and, at

the same time, keeps the communication cost low through-

out the network. Besides using benchmark Machine Learn-

ing datasets for extensive experimental evaluations, we have

evaluated the proposed method particularly for music genre

classification to exhibit its performance in a real application

scenario. Our collaborative classification method outper-

forms baseline approaches of model propagation by improv-

ing the overall classification performance substantially at the

cost of a tolerable increase in communication.

Keywords: Distributed, Classification, P2P, SVM, Skewed

1 Introduction

1.1 Motivation: In recent years there is an increas-ing interest for analytical methods that perform datamining over large-scale data distributed over the peers ofP2P networks such as eDonkey, BitTorrent and Gnutellato support applications like distributed document classi-fication for automatic structuring of heterogeneous col-lections and Web directories, distributed recommenda-tion systems, distributed intrusion detection, personal-

⇤Information Systems and Machine Learning Lab, Universityof Hildesheim, Germany, {khan, schmidt-thieme}@ismll.de

†Data Scientist at a Private Organization, [email protected]

ized media retrieval, online matchmaking, distributedspam classification, and many others. Data Mining inlarge-scale distributed P2P networks is a challengingtask, because centralization of data is not feasible. InP2P networks, computing devices might be connectedto the network temporarily, communication is unreli-able and possibly with limited bandwidth, resources ofdata and computation can be distributed sparsely, andthe data collections are evolving dynamically. A schemewhich centralizes the data stored all over a P2P networkis usually not feasible, because any change must be re-ported to the central peer (server), as it might alter theresult significantly. Therefore, the goal is to developdistributed mining algorithms that are communicatione�cient, scalable, asynchronous, and robust to peer dy-namism, which achieve accuracy as close as possible tocentralized ones.

Existing research in data mining for P2P networksfocuses on developing local classification or clusteringalgorithms which use primitive operations, such as dis-tributed averaging, majority voting and other aggregatefunctions to combine results of local classifiers in orderto establish a consensus of classification decisions amongpeers in a local neighborhood of a P2P network [16].[6] proposed a distributed decision tree for P2P net-works, by using distributed majority voting. Most ofthese locally synchronized algorithms are reactive in asense that they tend to monitor every change in dataand keep track of data statistics. Recently researchershave proposed methods to learn Support Vector Ma-chines(SVM) classifier in Peer-to-Peer networks, basedon the idea of exchanging local SVM models with theimmediate neighbors [18],[3]

One inherent problem of the aforementioned ap-proaches is that they do not consider P2P networkswith scale-free topology where the distribution of peerlinks and the amount of data stored in peers follows thepower-law distribution, i.e. dp ⇠ x

�a where dp refersto degree of a peer. In particular, such networks areprone to Execution skew and Data skew. Executionskew refers to non-uniform data access across the parti-tions of peers, while the Data skew describes an unevendistribution of data across these partitions. Most ofthe widely known real-world P2P systems like FreeNet,Gnutella, BitTorrent, eDonkey, FastTrack, Gia, UMM,

Page 15: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Phenix are prone to data or execution skew due to thescale-free growth of network topologies in their unstruc-tured overlays [17], [2]. Another reason of occurrenceof such skews are the hierarchical peer and data place-ments rules or tendencies in most of these P2P net-works, basically for e�cient query routing, scalability,handling node dynamicity and fault-tolerance [13]. Par-ticularly, this is the case with hierarchical P2P overlayswith multiple levels categorizing the peers according totheir heterogeneous routing roles and varying loads [12].In such widely used network topologies a group of fewsuper peers from top levels serve a vast majority of reg-ular or weak peers in the lower levels through routesspanning multiple levels and several hops. As a result,there are usually very few peers with many connectionsand large amount of data, whereas a vast majority ofpeers have few connections and very small amount oflocal data. Additionally, it has been demonstrated thatthe degree of a peer tends to correlate with the amountof data this peer contains [19],[20]. This correlation canbe explained with the fact that in most hierarchical P2Pnetworks like KaZaA and Gnutella, peers tend to con-nect with super-peers which o↵er faster response to dataqueries with high bandwidth, and show high availabil-ity. For example, in P2P song sharing, songs which arevery popular are shared by many peers, but only veryfew peers share the vast majority of songs [11]. Sincethe probability of finding a specific song is much higherwith these super-peers, they attract a lot of connectionsfrom regular peers. On the other hand, a vast majorityof peers are free riders [9], i.e., they share almost noor very little data, and hence are allowed fewer connec-tions. As a consequence, such regular peers usually donot have a su�ciently large local training data set andare unable to generate accurate classification models.

++

++

++

+++

+

+

+

+

++ +

++

++++

++

+

++++++

++

+++

++

+++

+

+ ++ +

+ ++

+++++

++++

++ ++

++ ++

++ +

++

+

+++

Figure 1: Conceptual example of P2P network withpower-law topology and model exchange

The aforementioned idea is exemplified in Figure 1,which illustrates transitive model propagation in a

P2P network topology that is ‘scale free’. Considerthe shadowed path in Figure 1, which contain peers{1 . . . 4}. Peer 1 is a super-peer, peer 2 has intermediatedegree, whereas peers 3 and 4 are weak peers (inFigure 1, peers with higher degree are depicted darkerand bigger and those with lower degree are lighterin color and smaller). With a direct propagationscheme, each peer p exchanges its local model mp

with all its neighbors. Peer 1 shares its model m

1

with peer 2, which allows peer 2 to improve its localmodel m

2

, because peer 1 is a super-peer and, thusmodel m

1

is informative. However, the weak peers,such as peer 4, would not benefit from this, becauseit exchanges models only with peer 3, which is alsoweak. Therefore, although a communication cost is paidto exchange models among weak peers, there will beno improvement in the accuracy for those weak peersthat are not connected to a super-peer. This problembecomes more severe, as the local models of weak peersmay be originally (before any model exchange) veryinaccurate and thus classification using these modelswill be meaningless (i.e., accuracy is close to randomclassification). As a result, weak peers that comprisethe vast majority of a P2P network, may not be ableto perform any meaningful classification. By allowingtransitive propagation of models, local models of super-peers (such as peer 1) will be able to reach weak peers(such as peer 4) connected to them through pathswith length larger than one. Based on these findings,the above mentioned approaches for model propagation[3], [18] may not scale well in inherently scale-free P2Pnetworks as exchanging models only with immediateneighbors restricts the informative knowledge from thefew high level super-peers with large number of peerconnections and data collections, to reach the vastmajority of low level regular or weak peers separatedby multiple hops. With very few connections and notenough discriminative patterns, large number of theseregular peers are unable to learn e↵ective classifiersdespite the cost of communication for model exchangeamong these peers as shown in section 4.3.

1.2 Contribution and Outline: The focus of thispaper is a collaborative classification method based onSupport Vector Machines in the decentralized hierar-chical P2P networks considering the above mentionedtopological characteristics. We chose SVM as for manydatasets it can perform favorably against other classi-fiers. Moreover, it is the most promising classifier basedon convex quadratic programming in which global op-timal solutions can be obtained. We propose a tran-sitive model exchange method (TRedSVM) to learnSVM classifier collaboratively among the peers in the

Page 16: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

global P2P network topology. In contrast to exist-ing approaches [3], [18], the proposed approach takesinto account the scale-free topology of P2P networksand allows communication in a transitive way, i.e., be-yond immediate neighbors. This way, our method helpsnot so well connected peers to receive local modelsof other, better connected peers, in order to enhancethe local models of the former, and thus to substan-tially improve classification accuracy over the entireP2P network. Since uncontrolled transitive propaga-tion of classification models can increase the communi-cation cost (leading eventually to flooding the network),TRedSVM keeps the communication in control first byusing a reduced variant of SVM called Reduced-SVM(RSVM) [15] to learn a local classification model at eachpeer and secondly but more importantly by performinga weighted reduction of model sizes at each transitivestep. Therefore, TRedSVM is able to transitively prop-agate most significant and compact form of knowledgeresulting in substantial increase in classification accu-racy of the overall network while keeping the communi-cation cost low.

In order to assess the e�ciency of TRedSVM in realapplications, we have applied it to music genre classifica-tion problem besides using benchmark Machine Learn-ing datasets for comparison with baseline model propa-gation schemes and data centralization approaches. Asmodern P2P systems rely on data replication to main-tain high data availability and faster response time, wehave evaluated the proposed method with data repli-cation based on relative popularity of data items. Ourexperimental results show that TRedSVM improves theoverall classification accuracy of the network substan-tially at the cost of tolerable increase in communication.It also provides a reasonable trade-o↵ between commu-nication cost and overall prediction accuracy, which al-lows it to compare favorably against existing baselineapproaches.

The rest of the this paper is organized as follows:Section 2 describes related work. Section 3 describesthe proposed approach of TRedSVM, followed by ex-periments and evaluations in Section 4 and conclusionsin the end.

2 Related Work

Current state-of-the-art in P2P data mining has evolvedfrom earlier attempts to perform some primitive oper-ations (average, sum, max, random sampling) in dis-tributed setting. For example, [10] proposed a gossip-based randomized algorithm to compute aggregates(MAX) in an n-node overlay network. [14] presenteddistributed algorithms for e↵ectively calculating basicstatistics of data using the newscast model of computa-

tion. A main goal of these works is to lay a foundationfor more sophisticated data mining algorithms and ap-plications.

Recent approaches in P2P data mining focus ondeveloping local classification or clustering algorithmswhich in turn make use of primitive operations such asdistributed averaging, majority voting and other aggre-gate functions. Most representative work in this regardis distributed decision tree induction [6], distributed K-Means clustering [8] and distributed classification [16].A common aspect of these approaches is that they makeuse of local algorithms, which compute their result usinginformation from just a handful of nearby neighbors. Amajor limitation of such local algorithms is that up tillnow, they have been conceived only to support prim-itive aggregate operations. A more well known local

algorithm is Majority Voting by [21]. In this algorithm,each peer maintains its local belief of an estimate of aglobal sum based on the number of nodes in that net-work, to confirm whether a majority threshold is met.If not then this peer needs to send messages to its neigh-bors to re-estimate the global sum. A similar work ispublished by [16] to achieve a consensus among neigh-boring peers using distributed plurality voting. Mostof these locally synchronized algorithms are reactive ina sense that they tend to monitor every single changein data and keep track of data statistics in their localneighborhood, which also requires extra polling mes-sages for coordination.

Based on model propagation, recently another im-portant work [18] proposed collaborative SVM modelexchange for distributed classification in P2P networks.Local SVM models are learned on each peer and ex-changed with immediate neighbors. All the receivedmodels are merged to build a final classifier. A simi-lar method is proposed in [3] by cascading SVM modelsamong immediate neighbors in a communication e�-cient way. Despite the several suitable characteristicsof such methods, their major limitation is that theyconsider uniform and controlled placement of peers anddata in P2P networks, and exchange local SVM mod-els only with immediate neighbors. As discussed insection-1, these approaches may not scale well in mostreal-world P2P networks which follow a hierarchical net-work topology with highly skewed distribution of peerdegrees.

3 Methodology: Transitive Reduced-SVM

3.1 Basic Concepts and Notations: More for-mally, we consider an ad-hoc P2P network comprisingof a set of such k autonomous peers P = {1, 2, . . . , k}.The topology of the P2P network is represented by a(connected) graph G(P,E), in which each peer p 2 P is

Page 17: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

represented by a vertex and an edge {p, q} 2 E, whereE ✓ {{p, q} : p, q 2 P}, whenever peer p is connectedto peer q. Also let Np denote the set of immediateneighbors of p, i.e., Np = {q 2 P |q 6= p, {p, q} 2 E}.The local training data set on a peer p is denoted asXp ✓ Rd+1, where d is the number of data features.Based on the local training data set Xp, each peer p

can first build its local classification model mp. Sinceexchanging classification models results in additionalcommunication cost, what is required is the construc-tion of local classification models that are both accurateand compact, i.e., they can be represented with a smallamount of information that is needed to be transmittedbetween neighboring peers. To attain this, we have useda variant of SVM called Reduced-SVM (RSVM) [15].Compared to SVM, RSVM has been reported to be ableto reduce the number of generated support vectors sig-nificantly while maintaining very good classification ac-curacy. A brief description of RSVM is given in nextsection. Interested readers are referred to [15] for de-tails.

3.2 Building Local Classifier: As the initializa-tion step of the proposed method, RSVM classificationmodel is built using the local training data at each peer.For the convenience of notation, here we represent localtraining data with X instead of Xp. Given is a training

set of instance-label pairs {(xj , yj)}|X|j=1

, where xj 2 Rd

is an input vector and yj 2 {�1, 1} is the correspond-

ing class label. RSVM starts by adding the term �2

2

to the SVM objective function, and solves the followingoptimization problem:

(3.1) min(w,�,")

1

2

�kwk2

2

+ �

2

�+

C

2k"k2

2

subject to D

�K

�X,X

T�w � 1�

�� 1� "(3.2)

where C > 0 balances training error and the regulariza-tion term in the objective function. The weight vectorw =

�w

1

, w

2

, . . . , w|X|�is perpendicular to the hyper-

plane that separates the two classes, and � is the in-tercept term. D is an |X| ⇥ |X| diagonal matrix suchthat Djj = yj , to specify the target class of each in-stance. K is the kernel function which computes thedot product between input vectors in a mapping fea-ture space. K

�X,X

T�is a |X|⇥ |X| matrix such that

K

�xi, x

Tj

�⌘ � (xi)

T� (xj) where � (x) maps x into a

higher (may be infinite) dimensional space. Solving thisoptimization for massive data sets by computing thefull dense non-linear kernel matrix in (3.2), results inunwieldy storage and computation overhead. To avoid

this overhead, the RSVM replaces the full kernel matrix

K

�X,X

T�with the reduced one K

⇣X, X

T⌘2 R|X|⇥r,

where X consists of r random instances from X, suchthat r ⌧ |X| where |X| being the number of instances inX. Subsequently RSVM classifier is learned by solvingthe following unconstrained minimization.(3.3)

min(w,�)2Rr+1

12

kwk22 + �2�+C2k⇣

1�Dn

K⇣

X, XT⌘

w � 1�o⌘

+k22

Note that the whole problem including coe�cients w

of separating hyperplane and the size of subset dataX used to represent this hyperplane, is reduced to apositive definite linear system of size r. This size canbe specified by the subset-ratio ⌘ of local data X to beused for RSVM such that r = d|X|⇥ ⌘e.

3.3 TRedSVM - Transitive Reduced-SVMApproach The pseudo-code of the asynchronousTRedSVM algorithm is given in Algorithm 1 and a listof symbols used is provided in Table 1. TRedSVM startsby learning local RSVM model mp at each peer p inthe first phase, using a predefined reduction parameter⌘ initialized by the network for all the peers globally.Subset-ratio ⌘ directly a↵ects the size of the resultingmodelmp, since it selects the random subset of instancesXp, used to learn a nonlinear hyperplane represented bysupport vectors, as described in Equation (3.3). A largevalue of ⌘ results in a potentially larger model size, andvice versa. Then, each peer p using its neighbor list Np

propagates the local model mp to all immediate neigh-bors.

Each peer p keeps a MAPp to map a model toits sender and a local bu↵er RECp to keep receivedmodels. Moreover for receiving models, a peer waits fortime t until models mq from all the neighbors q 2 Np

have been received. Once all the neighboring modelshave been received, in the transitive phase, each peerp reduces each of the received models mq using RSVMwith its local weighted reduction parameter ⌘p (line 25of Algorithm 1).1 This step ensures that the size of eachof the received models is further reduced and only themost statistically significant form reach the weak peers.As described in Section 1, there is a strong correlationbetween degrees of nodes and the size of data they carry,in this work we have chosen the value of ⌘p weighted bythe hubness level of peer p. Each peer p is assigned to ahubness level according to its proxy score h (p) defined

as: h (p) = (dp�µd)

�d, where dp is degree of peer p, µd and

�d are mean and standard deviation of degrees of all

1Parameters like ⌘, ⌘p and Np are network parameters pro-vided to each joining peer by its associated super-peer.

Page 18: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Table 1: List of symbols in TRedSVM

Symbol Meaning

Xp Local data at peer pNp Neighbor list of Peer pmp Local model of pth

peer learned using RSVM

MAPp Maps a model to its sender identifier

RECp Local bu↵er to keep received models of any peer n|n 6=p

⌘ Subset ratio (global) of data to learn local RSVM

models

⌘p Subset ratio (local) of data for transitive reduction at

peer p defined by its hubness level

the peers in the network, respectively. Hubness levelsresult by organizing the proxy scores of all peers in asmall number of groups (in our experiments we used5 hubness levels, described in Table 2 in Section 4.1).Hence in TRedSVM, peers with higher hubness levelssuch as super-peers perform less reduction in modelsize in the transitive step, so that they forward a largerchunk of this useful information to their associated weakpeers. On the other hand, weak peers with low hubnesslevels have insu�cient data to learn useful models, usesmaller ⌘p to perform larger reduction and propagatemuch smaller portion of possibly redundant knowledgeas there is a high probability that super-peers mayalready have the data points that a weak peer wantsto share. After reducing each of the received modelsto only most significant support vectors in the secondphase, for each of these models in RECp, peer p queriesall its neighbors q where q 2 Np and q 6= n. In response,if q does not have the queried model or has one with asize similarity smaller than threshold �, it acknowledgesp to send this model. Otherwise, peer q sends adecline. The optimal value of � is selected empiricallyand in our experiments we used � = 50. Again, thispolicy is chosen to ensure that a peer only keeps moreinformative models that have been routed to it bystronger peers. On receipt, each peer adds the modelto its bu↵er and updates the map to MAPp(n). Thisprocess of reduction based relearning and controlledpropagation in the second phase continues at eachpeer until it has received, from all the neighbors,subsequent more stronger models which it should haveobtained. Once the propagation to and receipt fromall the neighbors is finished, each peer p merges theinstances xj in all the received models MAPp(n)|n 2 P

from q 2 Np to its local training data and learnsa final SVM model mp. Such a model propagationscheme can also account for Peer Dynamism, i.e. peerscan join and leave the network anytime. Even ifa peer goes o✏ine, its model still exists on otherpeers of the network. This factor allows maintaininghigh classification accuracy by sharing models of peerswhich were not present in the network at the same

time. Finally, it has to be mentioned that althoughTRedSVM performs transitive propagation of models,it is still a local algorithm, in the sense that each peerperforms computation using information provided by itsneighboring peers. Therefore, the resources required byTRedSVM are independent of the total number of peers.

Algorithm 1 TRedSVM for Classification at peer p

Require: Local training data Xp, t=time to wait betweenretraining, Np=List of neighbors, ⌘ = Subset ratio tolearn peer’s local model, ⌘p = Subset ratio defined byhubness level to use for transitive reduction

Ensure: Updated model mp

1: mp = RSVM (Xp, ⌘)2: MAPp := {(p,mp)}3: REC p := ;4: . Communication and Model update threads execute in

Parallel5: loop{communication thread}6: event=receive event()7: if type(event)== OFFER MODEL then

8: (n, sizen):= args(event)9: if |MAPp(n)| == 0 or sizen > |MAPp(n)|+� then

10: send REQUEST MODEL(n) to sender(event)11: end if

12: else if type(event)== REQUEST MODEL then

13: n:=args(event)14: send MODEL PARAMETERS(n,MAPp(n)) to

sender(event)15: else if type(event)== MODEL PARAMETERS

then

16: (n,m):=args(event)17: RECp(n):=m18: end if

19: end loop

20: loop{Model update thread}21: start time = current time()22: while current time()-start time< t do23: if exists n | |RECp(n)| > 0 then

24: Xp:=Xp \MAPp(n) [RECp(n)25: MAPp(n):=RSVM(RECp(n),⌘p)26: RECp(n) = ;27: send OFFER MODEL(n, |MAPp(n)|) to all q 2 Np

28: end if

29: end while

30: mp:=RSVM(Xp,⌘)31: use mp

32: end loop

4 Performance Evaluation

In this section, we first describe the experimental setup,after which experimental results are presented, andfinally we discuss these results.

4.1 Experimental Setup

Page 19: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Topology and Simulation Environment: Togenerate a multi-level hierarchical P2P network withpower law distribution of peers, we used the Barabasi-

Albert model [4] with parameters like preferential con-

nectivity and incremental growth, where a joining peer pconnects to an existing peer q with a probability givenas P (p, q) = dqP

k2V dk, where dq is degree of existing

node q, V is the set of nodes that have already joinedthe network, and

Pk2V dk is the sum of degrees of all

nodes that previously joined the network. For this pur-pose we used the BRITE, a network topology generationframework developed at Boston University [1]. The re-sult is a topology represented by a weighted graph withedge weights representing communication delays in mil-liseconds. We evaluated our experiments with variednumber of peers ranging from 100 to 500. We did notexperiment with more peers, since it would result in un-realistically small sizes of local data at peers and couldadversely a↵ect the performance of P2P classificationsystems. For local computation at each peer and moni-toring of message exchange, we have built our own sim-ulator (using Java) that simulates distributed dynamicP2P overlay trees. We have used communication de-lays on the BRITE network graph as measurement oftime. Local computations and message exchange areregulated with change of network state, with respect toa common global clock of a simulator.

Experiments were conducted on a cluster of 41machines, each with 10 Intel-Xeon 2.4GHz processors,20GB of Ram, and connected by gigabit Ethernet. Forlearning local classifiers, we used the LIBSVM ([7])implementation of RSVM. Moreover, to compare thenumber of support vectors generated by RSVM withregular SVM, the C-SVC implementation of regularSVM is used. For RSVM, we used the RBF kernel,and for each data set, the hyper-parameters � and C

were empirically found using the model selection toolprovided by LIBSVM (covertype : � = 2�7, C = 2�0.1

and cod� rna : � = 2�2, C = 2�2).Data Sets: We have used existing benchmark Ma-

chine Learning data sets which are among the largest inthe corresponding widely used repositories to evaluatethe performance of our method in comparison to thestated baselines. In particular, we used the:

• covertype (581012⇥54, 7 classes) data set from UCIrepository, and

• cod-rna(488565 ⇥ 8, 2 classes) data set from LIB-SVM repository [7].

• Million Song Dataset(1, 000, 000 ⇥ 33, 10 classes)data set from Million Song Dataset website [5]

Data Distribution and Hubness Levels:TRedSVM considers a probabilistic distribution oftraining data such that size |Xp| of data on a peerp is directly proportional to its degree dp, such that

|Xp| = |X |. dpPd , where |X | is the size of original (cen-

tralized) data set. Figure 2 depicts the highly skeweddistribution of the sizes of local training sets among allpeers. Based on the resulting distribution of degrees

Figure 2: Data distribution among peers in TRedSVMsimulation

and data among peers, TRedSVM can consider severalhubness levels, as discussed in Section 3.3. In our eval-uation we selected five levels ranging from super-peers(Level-1) to weak peers (Level-5). Peers in each hubnesslevel used a corresponding ⌘p value assigned to theirhubness level to learn an RSVM model during the tran-sitive propagation phase. Table 2 lists the ⌘p valuesassigned to peers in each hubness level based on theirhubness score. Note that the higher the hubness level(closer to Level-1), the higher the ⌘p value.

Table 2: Grouping of peers into hubness levels.

Hubness levelDegree of Peers Percentage of SVs

Level-1 di � 20 ⌘p = 0.1Level-2 15 di < 20 ⌘p = 0.075Level-3 10 di < 15 ⌘p = 0.05Level-4 5 di < 10 ⌘p = 0.025Level-5 di < 5 ⌘p = 0.01

4.2 Music Genre Classification: Music genre areconsidered to be important metadata for retrievingsongs in the field of music information retrieval. TheMillion Song Dataset (MSD) is the largest publiclyavailable dataset consisting of audio features and meta-data for a million contemporary popular music tracks[5]. But none of the songs in MSD have any genre labels.In order to evaluate TRedSVM in a real application sce-nario, we have employed it to the task of automatic clas-sification of songs in MSD based on their genres. Artisttags in MSD are used to describe genres and to extractsimple features from the tracks of these artists. To se-lect standardized, descriptive and consistent tags, we

Page 20: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

have chosen the following 10 tags from the top 50 mostpopular MusicBrainz tags: classic pop and rock, folk,dance and electronica, jazz and blues, soul and reggae,punk, metal, classical, pop, hip-hop. For features, wehave used the following ones from The Echo Nest [5]loudness, tempo, time signature, key, mode, duration,average and variance of timbre vectors. As described inthe following section, TRedSVM significantly improvesthe percentage of songs with correctly classified genres,as compared to baseline approaches. Although existingapproaches in music information retrieval have shownthat audio features when combined with lyrics and so-cial tags can help classifying song genre more accurately,to account for the scope of this paper, we have only usedsimple audio and timbre related features.

4.3 Experimental Results Performance ofTRedSVM is evaluated in terms of classificationaccuracy and communication cost, which are the twomost significant aspects for the problem of classificationin P2P networks. Comparison is conducted againstthe state-of-the-art approaches of model propagationamong neighboring peers proposed as AllCascade by [3]and CSVM by [18]. In the results below, we refer tothese approaches as Direct Neighbors Exchange. Toillustrate an upper bound on communication cost, wealso performed experiments with the network Flooding

baseline that performs transitive propagation with-out controlling the communication cost i.e. withoutperforming any reduction on propagated model. Ef-fectiveness of transitive reduction w.r.t Flooding evenwithout querying for redundant model exchange amongpeers, is also presented as a variant Uncoordinated-

TRedSVM. Finally, to demonstrate the e↵ectiveness ofmodel exchange, we also consider a baseline methodthat learns a Local Model of classification without anymodel exchange. In our experiments, we distributedthe test data both uniformly across all the peers aswell as using the same skewed distribution as the traindata. The results from both types of experiments didnot show any significant change in the classificationaccuracy.

Network wide Classification Accuracy: Wefirst focus on average classification accuracy of thewhole network. Figure 3 depicts the resulting accuracyfor the four examined methods and the three datasets. As expected, the simple baseline of Local Model

that involves no exchange of models in Figure 3), isclearly outperformed by all other methods. TRedSVMcompares favorably to the method that involves modelexchange only among direct neighbors. As expected,best accuracy is achieved for naive Flooding, but within-feasible cost of communication as discussed below.

0

5

10

15

20

25

30

covertype cod-rna MSD

Rela

tive

Com

mun

icat

ion

Cost

Direct NeighborExchange

TRedSVM

Broadcast Best Model

Uncoordinated-TRedSVM

Flooding

Figure 4: Relative communication cost of TRedSVMcompared with the methods of direct exchange and flooding

Communication Cost: Figure 4 illustrates thecommunication cost incurred by the four examinedmethods, since the simple baseline that involves no ex-change of models has by definition no communicationcost. The costs are relative to each other when nor-malized by cost of Direct Neighbors Exchange. ThoughFlooding achieved best accuracy, Figure 4 clearly showsthat flooding requires a prohibitive communication cost,when data becomes huge. Thus, it cannot comprisea feasible method in real P2P networks. TRedSVMrequires communication cost that is substantially lessthan that of flooding and only marginally larger com-pared to the method that involves exchange of mod-els only between direct neighbors. The cost incurredby Uncoordinated-TRedSVM demonstrate that transi-tive reduction, even without coordination, does actuallyreduces the cost substantially as compared to simpleflooding. We have also considered to evaluate a pos-sible scenario in which we broadcast the most accuratemodel (possibly from super-peers) to all the peers in thenetwork. For example, for covertype data, we consid-ered the model with best accuracy (73.3%) belongingto a super peer. The size of this model (in terms ofnumber of support vectors) was 3911. This simulationwas comprised of 177 peers including the super peer inconsideration. So, broadcasting this model to 176 peers(3911 x 176) of the network still incurred twice the rel-ative communication cost of TRedSVM, as depicted inFigure 4.

Classification Accuracy per Hubness Level:To provide further insight into the performance ofTRedSVM, we compared the average classification accu-racy of all examined approaches separately for the peersof each hubness level. Figure 5 shows that for super-peers (Level-1), all methods attain comparable accu-racy. This is natural to expect, since super-peers hav-ing adequately large training sets and thus gain noth-ing from model exchange. This explains also the good

Page 21: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

55

58

61

64

67

70

73

76

79

covertype

Avg.

Pre

dict

ion

Accu

racy

of N

etw

ork

(%)

Local Model

Direct Neighbor Exchange

TRedSVM

Uncoordinated-TRedSVM

Flooding68

70

72

74

76

78

80

82

84

86

88

90

92

cod-rna

Av

g.

Pre

dic

tio

nA

ccu

racy

of

Net

wo

rk (

%) Local Model

Direct Neighbor Exchange

TRedSVM

Uncoordinated-TRedSVM

Flooding 35

37

39

41

43

45

47

49

51

53

55

57

59

MSD

Av

erag

e P

red

icti

on

Acc

ura

cy o

f th

e N

etw

ork

(%

)

Local Model

Direct Neighbor Exchange

TRedSVM

Uncoordinated-TRedSVM

Flooding

Figure 3: Average prediction accuracy of TRedSVM as compared with Local Models, Direct Exchange Method andFlooding

45

50

55

60

65

70

75

Level-5 Level-4 Level-3 Level-2 Level-1

Pre

dic

tio

n A

ccu

racy

(%

)

covertype

Local Model

Direct NeighborExchange

TRedSVM

Uncoordinated-TRedSVM

Flooding65

70

75

80

85

90

Level-5 Level-4 Level-3 Level-2 Level-1

Pre

dic

tio

n A

ccu

racy

(%

)

cod-rna

Local Model

Direct NeighborExchange

TRedSVM

Uncoordinated-TRedSVM

Flooding35

40

45

50

55

Level-5 Level-4 Level-3 Level-2 Level-1

Pre

dic

tio

n A

ccu

racy

(%

)

MSD

Local Model

Direct NeighborExchange

TRedSVM

Uncoordinated-TRedSVM

Flooding

Figure 5: Average classification accuracy of peers at each hubness level and for each method

performance of the simple baseline (Local Model) thatinvolves no exchange of models. However, for peersat lower levels, and especially for week peers (Level-5), TRedSVM is able to provide important improve-ments in accuracy. It is worth to note that in thecase of the covertype data set, at weak peers (Level-5) the simple baseline (Local Model) achieves accuracyclose to a constant classifier. The method that performsexchange only between direct models attains accuracyonly slightly better than random. TRedSVM howeverhelps a vast majority of weak peers to improve theiraccuracy as much as possible, as shows the comparisonwith the upper bound of the flooding method.

Accuracy Communication Cost Trade-o↵:Finally, we examine the impact of parameter ⌘ at eachtransitive step in TRedSVM. Figure 6 depicts in a com-bined way prediction accuracy and communication costwith respect to the percentage of support vectors usedin RSVM. As expected, communication cost increasessteadily with increasing ⌘ values. Prediction accuracyalso increases with increasing ⌘ values but only up to athreshold (⌘ = 10 for covertype and cod-rna data), afterwhich accuracy becomes steady. This clearly shows theintuition behind the principle we follow in TRedSVM,i.e., the need to keep the size of exchanged models un-der control, because otherwise communication cost in-creases rapidly without any gain in accuracy.

Due to space limitation, we could not include ourexperimental results for data replication and data andmodel centralization schemes, which also exhibit thee↵ectiveness of the proposed method. These resultswill be provided in the extended version of paper oras appendix if required.

5 Discussion and Conclusion

The first important conclusion from the presented ex-perimental results is that for the P2P systems whosetopology follows a skewed distribution of peer degreesand data, the transitive propagation of TRedSVM helpsthe vast majority of weak peers to substantially improvetheir classification accuracy. Without any exchangeof models between peers (simple baseline called LocalModel) or with a restricted exchange (Direct NeighborsExchange), weak peers have no good chances to receiveinformative models from stronger peers that have largenumber of informative instances. In model propagationapproaches of distributed classification in P2P networks,trade-o↵ between prediction accuracy and communica-tion cost is crucial and a challenging reality. Results inFigure 3 and Figure 4 illustrate lower and upper boundson accuracy and communication cost, respectively. Di-

rect Neighbors Exchange exhibits a lower bound on com-munication cost, while Flooding resulted in an upperbound on prediction accuracy. TRedSVM fitted quitewell in these two bounds by incurring slightly larger

Page 22: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

55

57

59

61

63

65

67

69

71

73

75

10

20

30

40

50

60

70

80

90

0 2 4 6 8 10 12

Pre

dic

tio

n A

ccu

racy

(%

) (b

lue)

Co

mm

un

icat

ion

Co

st (

red

)

Percentage of Data used as Support Vectors (η)

covertype

75

77

79

81

83

85

87

89

91

10

60

110

160

210

260

0 2 4 6 8 10 12

Pre

dic

tio

n A

ccu

racy

(%

) (

blu

e)

Co

mm

un

icat

ion

Co

st (

red

)

Percentage of Data used as Support Vectors (η)

cod-rna

Figure 6: How prediction accuracy and communication cost varies with respect to ⌘ (Colored)

communication cost than the lower bound, and smallerby large magnitudes than in-feasible naive flooding.Moreover, controlling the model sizes with respect to ⌘p

at transitive steps is a useful idea, since after a certainthreshold of model size, accuracy is not improved fur-ther and communication cost is paid without a reason,as illustrated in Figure 6. Therefore, the optimizationof the trade-o↵ between accuracy and communicationcost makes TRedSVM scalable in large P2P networks.

References

[1] I. M. J. B. Alberto Medina, Anukool Lakhina,Brite: Boston university representative internet topol-ogy generator, 2002.

[2] N. Andrade, E. Santos-Neto, F. Brasileiro, andM. Ripeanu, Resource demand and supply in bit-torrent content-sharing communities, Computer Net-works, 53 (2009), pp. 515–527.

[3] H. H. Ang, V. Gopalkrishnan, S. C. Hoi, andW. K. Ng, Classification in p2p networks with cas-cade support vector machines, ACM Transactions onKnowledge Discovery from Data (TKDD), 7 (2013),p. 20.

[4] A.-L. Barabsi and R. Albert, Emergence of scalingin random networks, Science, 286 (1999), pp. 509–512.

[5] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, andP. Lamere, The million song dataset, in Proceedingsof the 12th International Conference on Music Infor-mation Retrieval (ISMIR 2011), 2011.

[6] K. Bhaduri, R. Wolff, C. Giannella, andH. Kargupta, Distributed decision-tree induction inpeer-to-peer systems, Stat. Anal. Data Min., 1 (2008),pp. 85–103.

[7] C.-C. Chang and Lin, Libsvm: A library for supportvector machines, ACM Transactions on Intelligent Sys-tems and Technology, 2 (2011), pp. 27:1–27:27.

[8] S. Datta, C. Giannella, and H. Kargupta, Ap-proximate distributed k-means clustering over a peer-to-peer network, IEEE Trans. on Knowl. and DataEng., 21 (2009), pp. 1372–1388.

[9] M. Karakaya, I. Korpeoglu, and O. Ulusoy,Counteracting free riding in peer-to-peer networks,Comput. Netw., 52 (2008), pp. 675–694.

[10] D. Kempe, A. Dobra, and J. Gehrke, Gossip-basedcomputation of aggregate information, in Proceedingsof the 44th IEEE FOCS, 2003.

[11] N. Koenigstein, Y. Shavitt, E. Weinsberg, andU. Weinsberg, On the applicability of peer-to-peerdata in music information retrieval research, in Pro-ceedings of the 11th ISMIR 2010, pp. 273–278.

[12] D. Korzun and A. Gurtov, Survey on hierarchicalrouting schemes in flat distributed hash tables, Peer-to-Peer Networking and Applications, 4 (2011), pp. 346–375.

[13] D. Korzun and A. Gurtov, Hierarchical architec-tures in structured peer-to-peer overlay networks, Peer-to-Peer Networking and Applications, (2013), pp. 1–37.

[14] W. Kowalczyk, M. Jelasity, and A. E. Eiben,Towards data mining in large and fully distributedpeer-to-peer overlay networks, in In Proceedings ofBNAIC03, 2003, pp. 203–210.

[15] Y.-J. Lee and O. L. Mangasarian, Rsvm: Reducedsupport vector machines., in SDM, vol. 1, SIAM, 2001,pp. 325–361.

[16] P. Luo, H. Xiong, K. Lu, and Z. Shi, Distributedclassification in peer-to-peer networks, in Proceedingsof the 13th ACM SIGKDD, ACM, 2007, pp. 968–976.

[17] A. Malatras, State-of-the-art survey on p2p overlaynetworks in pervasive computing environments, Jour-nal of Network and Computer Applications, (2015).

[18] O. Papapetrou, W. Siberski, and S. Siersdorfer,E�cient model sharing for scalable collaborative classi-fication, Peer-to-Peer Networking and Applications, 8(2015), pp. 384–398.

[19] S. Saroiu, K. P. Gummadi, and S. D. Gribble,Measuring and analyzing the characteristics of nap-ster and gnutella hosts, Multimedia Syst., 9 (2003),pp. 170–184.

[20] D. Stutzbach, S. Zhao, and R. Rejaie, Character-izing files in the modern gnutella network, MultimediaSystems, 13 (2007), pp. 35–50.

[21] R. Wolff, K. Bhaduri, and H. Kargupta, Local l2thresholding based data mining in peer-to-peer systems,in SIAM SDM, 2006, pp. 430–441.

Page 23: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Towards Scalable Graph Analytics on Time Dependent Graphs

Suraj Poudel

⇤Roger Pearce

†Maya Gokhale

Abstract

In many graph applications, the structure of a graphchanges dramatically as a function of time. Such appli-cations require temporal analytics to fully characterizethe underlying data. Significant research into scalableanalytics of static graphs (where the structure remainsfixed) on High Performance Computing (HPC) systemshas enabled new capabilities, but solutions for scalableanalysis of temporal graphs are lacking. We presentour ongoing work to annotate a scale-free graph in dis-tributed memory with temporal metadata to model atime dependent graph. We demonstrate time dependentgraph analytics, and provide initial scalability results ona data-intensive HPC cluster.

Our approach allows an edge to have local access toits temporal metadata and thus facilitates algorithms tomake various temporal decisions, e.g., either to accept ordiscard the current edge at current time. We performedexperiments with large scale internet tra�c flows ex-tracted from CAIDA datasets on a data-intensive HPCCluster (Catalyst at LLNL). Using this dataset, graphrepresentation, and the HavoqGT [1] framework, wesuccessfully implemented and evaluated time dependentConnected Vertices, Single Source Shortest Path (TD-SSSP) and Betweenness Centrality. Finally, Between-ness Centrality was used to test the scalability of frame-work to 128 compute nodes of Catalyst.

Keywords: Time Dependent Graph, Graph Analytics,Distributed Graph, HPC.

1 Introduction

Time dependent graphs, also referred to as temporalgraphs or time varying graphs, include temporal infor-mation in addition to the topological structure. Edgesin time dependent graphs may not continuously ex-ist and are a function of time, unlike static graphswhere only a single topological structure exists. Systemswith dynamic topological structure are commonly rep-resented as aggregated graphs, e.g., as weighted graphswhere weights are assigned as number of times edgesoccur, or the percentage of total time the edge exists.

⇤University of Alabama at Birmingham, [email protected]

†Lawrence Livermore National Lab, {rpearce,maya}@llnl.gov

The aggregated representation is used, in part, becausemany tools are available for analyzing static graphs,while there is an absence of solutions for large tempo-ral graphs. A recent review of temporal networks [2]discusses the importance and methods to analyze thetemporal structure and model for understanding its im-pact on the behavior of the dynamical systems.

Analytics on time dependent graphs has been ap-plied in many fields. Some examples include manage-ment and planning of complex networks [3, 4], design-ing strategies for containing the spread of malware inmobile devices [5], analysis of the circadian patterns ofWikipedia editorial activity to estimate the geograph-ical distribution of editors [6]. With temporal infor-mation, existing graph analytics can be processed withrespect to some global time parameters such as Start

time, the time at which a traversal begins from a ver-tex and/or Waiting time, the upper bound on the timea traversal can halt in a vertex before continuing itstraversal to its neighbors as shown in [7]. As an exam-ple, time dependent single source shortest path (TD-SSSP) in a road network could yield optimal start timefor a fuel e�cient travel from a source to destinationincluding amount of time to wait in a city before pro-ceeding to the next. Also, the temporal information canbe utilized to benefit towards other purposes, e.g., de-caying weights, establishing probabilities with time. Inthis work we experimented with similar analytics withrespect to these time parameters in large scale-free dis-tributed graphs as an illustration of the capability ofthe framework we have developed.

Significant research in the HPC community hasfocused on the analysis of large static graphs [8, 9, 10],fueled in large part by the introduction of the Graph500benchmark [11]. Much of this work has focused onprocessing the largest graph topology possible, on someof the world’s largest supercomputers. However, littleattention has been given to the processing of graphmetadata (annotations on vertices and edges) or thetemporal nature of many graph datasets.

This work extends HavoqGT [1] to add temporalmetadata to the edges in the existing partitioned-graphtopology and analyze the resulting metadata-annotatedgraph with various time dependent algorithms. We aimto demonstrate a new set of capabilities for analyzing

Page 24: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

large scale temporal graphs on modern HPC clusters, byperforming scaling studies with existing time dependentanalytics. Our main contributions in this work are:

1. Extending the capability in HavoqGT to sup-port implementation of time dependent graph al-gorithm;

2. Demonstrate temporal graph analytics across twoexample time parameters – traversal start and waittime;

3. Show initial scalability results of the extendedHavoqGT for time dependent algorithms on a HPCcluster.

2 Related Work

Algorithmic optimizations to compute time-dependentalgorithms such as shortest paths over large graphs byDing et al. [7] suggest the need for faster computationfor temporal graph analysis. A use case study by Cat-tuto et al. [18] shows the model of a time-varying so-cial network as a property graph in the Neo4j graphdatabase that focuses on database-type queries based ontime, rather than algorithms for graph analytics such astime dependent SSSP. A distributed programming ap-proach by Simmhan et al. [19] uses temporally iterativeBSP model on more specific time-series graphs whereedge parameters change in time more frequently (thantopology). Han et al. [20] proposes time-locality awarein-memory layouts for snapshots of temporal graphsfor optimal performance on batch-scheduling of graphcomputations. However, the paper states that benefitsof suggested optimizations are not prominent in morenetwork-constrained environments. In our work, we useour edge based-partitioning with distributed delegatesto partition edges (duplicating for each time durationit exists) and use asynchronous visitor model to designgraph analytics to execute on distributed-memory HPCsystems.

Related research into e�cient multithreaded be-tweenness centrality by Madduri et al. [12], anddistributed-memory breadth-first search [8, 9, 10] havefocused on algorithm specific implementations and op-timizations for static graphs on HPC Systems. Ad-ditionally, recent frameworks like PowerGraph [14],GraphLab [15], Pregel [16], GraphX [17] illustrate theiruse only on large static graphs. In this work, we extendan existing framework, HavoqGT, designed for graphanalytics on HPC systems, for use on large scale time-depending graphs.

Table 1: Visitor Procedures and State

pre visit

Performs a preliminary evaluation ofthe state and returns true if thevisitation should proceed.

visit Main visitor procedure

operator >

Greater than comparison used tolocally prioritize the visitor in a maxheap priority queue.

vertex

Stored state representing the vertexto be visited.

3 Background

3.1 HavoqGT. HavoqGT (Highly AsynchronousVisitor Queue Graph Toolkit) [1] is a framework de-veloped based on our previous work [21, 22, 23]. Itpresents a new graph partitioning technique and compu-tation model that distributes the storage, computationand communication for high-degree vertices (hubs) inlarge scale-free graphs by creating vertex delegates, hubvertices replicated at each partition, co-located with theedges’ target to achieve a partitioned graph that reducescommunication with the neighbors of hub-vertices andallows faster parallel traversal. It also provides the dis-tributed framework that allows implementation of algo-rithms using an asynchronous visitor abstraction to berun in parallel and distributed environments on largescale-free graphs.

This work uses the distributed asynchronous vis-itor queue abstraction from HavoqGT to provide theparallelism and to create a data-driven flow of compu-tation. Traversal algorithms are created using a visitorabstraction, which allows an algorithm designer to de-fine vertex-centric procedures to execute on traversedvertices with the ability to pass visitor state to other ver-tices [23]. This abstraction requires certain visitor pro-cedures and state to be defined in the visitor as shownin Table 1. Algorithm implementations presented inthis paper follow this abstraction. When an algorithmneeds to traverse to another vertex, it dynamically cre-ates a new visitor and pushes it into the visitor queue.When an algorithm begins, an initial set of visitors arepushed on the queue and the framework’s driver invokesthe traversal which runs the asynchronous traversal tocompletion.

Our previous work [23] showed that HavoqGT pro-vided excellent scalability on large scale-free staticgraphs up to 131K cores of the IBM BG/P, and out-performed the best known Graph500 performance onBG/P Intrepid. In this work, we build upon this ex-isting scalable framework to provide graph analytics fortime dependent graphs.

Page 25: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

3.2 Datasets. To experiment with the time depen-dent graphs, we required a dataset representative oftime varying large scale-free graphs. The CAIDA [24]dataset consist of ⇠8 TB of publicly available largeanonymized passive tra�c traces and packet headertraces saved per direction per server collected over 39 1-hour periods. As packets represent finer-granularity innetwork communication mostly governed by the under-lying protocol, higher abstraction of network communi-cation i.e. the connection is more likely to represent andresemble numerous other real-world graphs.

Flows between two communicating sockets were de-rived by binning the packets exchanged between socketsby their start time into fixed-sized time windows. Allpackets communicated between two sockets in a definedtime window form a flow. As shown in Figure 1, Packet1 and Packet 2 are combined to form a flow as bothof these packets fall within the 0.0001s time window.A fixed timeout value from the start time of the firstpacket was used to gather a flow’s packets (describedbelow):

Definition 3.1. A flow from X to Y, between two

communicating socket’s X and Y, corresponds to all the

packets pi transmitted from X to Y, where i = m to n

(m n) and pi is ordered in non-decreasing start time,

such that

1. start time (pm) is the flow

0s start time

2. end time (pn) � start time (pm) timeout, and

3. start time (pn+1) > (start time (pm) + timeout)

Applying this algorithm with timeout of 0.0001s,we generated ⇠141 billion flows with the ⇠8 TB CAIDAdatasets. The generated flow files were split into smallerchunks, and stored on a Lustre distributed filesystem.

3.3 Time Dependent Single Source ShortestPath. A SSSP in a weighted graph (with time delayedge weight) is a query requesting the shortest time pathfrom a source vertex to all the other vertices. In a timedependent graph, the SSSP is influenced largely by thecurrent timestamp of the graph traversal because edges(unlike in time independent graphs) exists as a functionof time.

In TD-SSSP, the arrival time at destination verticesdepends on when the traversal begins at the sourcevertex, referred to as the traversal start time. Traversalmay arrive to a vertex at di↵erent times because edgesthat exist at one point in time of traversal can bedi↵erent (might not exist or have di↵erent edge weights)at another point in time of traversal at the same parentin current path. The notion of the traversal waiting

time on a vertex further adds to the dynamic behaviorof the traversal by having a potential to provide more

Figure 1: Figure showing how two packets are mergedto form the flow. Flow begins with the first packetand then collects all the subsequent packets until thetimeout. Notice that the parameters size and packetcount are accumulated over all the packets in the flow.Also the start and the end time of the flow merge thetime intervals of each packets into one interval.

(sometimes optimal) choices of path with the wait. Asan example, the edge delay of a flow, the time requiredfor the flow to reach the destination IP, at time t2 inaddition to the waiting time in the current IP can beless than the edge delay at any other time t1 ( andt1 < t2). So, TD-SSSP answers queries of finding theoptimal start time and wait time that minimizes thetotal traversal cost.

3.4 Temporal Betweenness Centrality. Between-ness Centrality is the measure of the extent to whicha vertex in a graph is important or central, with re-spect to its presence on the shortest path between pairsof other vertices [25]. The betweenness centrality of avertex is the ratio of the number of shortest paths be-tween pairs of other vertices passing through that vertexto the total number of shortest paths between pairs inthe graph. Temporal Betweenness Centrality measuresthe importance of the vertex in the temporal shortestpaths between pairs of other vertices i.e. the tempo-ral betweenness for a vertex on a given time interval isthe sum of the proportion of all the temporal shortestpaths through that vertex to the total number of tem-poral shortest paths over all pairs of vertices for eachtime interval in the given time interval. This algorithmrelies on calculation of the temporal shortest path across

Page 26: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

pairs at di↵erent time interval, as described in Section3.3.

4 Distributed Time Dependent GraphConstruction

To process a time dependent graph in HavoqGT, thegraph must be ingested from a raw unpartitioned inputsource. The input is an unordered edge list with edgemetadata and temporal information. In two passes, wefirst partition the graph topology and second, decoratethe graph with the corresponding metadata.

The graph topology is partitioned using the dis-tributed delegate techniques described in [23]. Theedge list of the input graph is provided to the frame-work using the source and destination IP pair from themetadata list. Several communication phases occur dur-ing the partitioning phase to partition and balance thetopology. We avoid bundling the edge with its metadataduring the partitioning process, which could unneces-sarily increase the communication volume. Instead, inthe second pass edge metadata is directly sent to theedge’s final location using the asynchronous visitor ab-straction.

Using the asynchronous visitor model, a visitor iscreated for each edge metadata and pushed into theglobal visitor queue, which pushes the visitor to thepartition owning the source vertex of the edge, thusrelaying the metadata to the outgoing edge of thecorresponding vertex. However, for the high out-degreevertices which have been delegated, not all edges can belocated on a single partition. In such case, the controllerbroadcasts the edge metadata to hub delegates andregisters it to outgoing edge of the delegate in thepartition that owns the edge for that metadata.

In our experiments, we successfully ingested 1.35billion edges (one hour CAIDA flows) on the Catalystcluster in around 5 minutes using 32-compute nodes and24 processes per node. The raw edge input typicallyresides on a shared resource such as a distributedLustre file system, sometimes causing large performancevariations depending on current load.

5 Distributed Graph Analytics

In this section, we show how we use multiple, temporal-metadata entries per edge to represent a time depen-dent graph, and show implementation details of timedependent single source shortest path (TD-SSSP) andbetweenness centrality (TD-BC).

5.1 Time Dependent Graphs. Each flow presentspackets sent from one socket to the other for a timeinterval [a start time, an end time). Figure 1 shows atypical flow with its start and end time along with other

Figure 2: An example time dependent graph over threetime intervals

parameters describing it(e.g., total size transferred inthe flow is 2908). Two communicating sockets can havedi↵erent flows between them at di↵erent time intervals.We use flows to combinely represent edges and theirmetadata. The communication from a socket to theother (link) in the flow represents the directed edge andall the other parameter including time interval in theflow represents the metadata for that edge.

Let us denote the i-th flow passes from socketui to vi and its start time and end time are ts

i

and tfi respectively and define [0, T ] as the global

time interval of the dataset i.e. there is no flowwith end time beyond time T. Then, ei = (ui, vi),the i-th edge, represents link in i-th flow and mi =�si = ts

i, fi = tf

i, otheri

�, the metadata for the i-th

edge contains all the other information for the i-th flow,where ui is the source vertex, vi is the destinationvertex , si is the start time of edge i, fi is the finishtime of edge i and otheri represents other parametersinvolved in the flow. Let (Ts, Tf ) be any time intervalwindow. If M (Ts, Tf ) is the set of metadata suchthat Ts start time(mi 2 M (Ts, Tf )) Tf , thenthe edge corresponding to each metadata mi forms aset of edges E (Ts, Tf ) that exists in this time interval(Ts, Tf ). Thus, the graph GhTs, Tf i (V,E (Ts, Tf ))represents the time dependent graph in time interval(Ts, Tf ). So, di↵erent sets of graph topology withdi↵erent edge parameters at each edges can be seen for0 Ts < Tf T .

Figure 2 shows an example of a time dependentgraph modeled from multi-metadata per edge. If M(0,

Page 27: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Figure 3: Plot of number of connected vertices from a single source with respect to the change in start timeexperimented over di↵erent waiting time. Time independent traversal from this source has 133 connected verticeswhile the time dependent traversal creates significant variation that is further a↵ected by the waiting time spenton each vertex.

t1) = {m1, m3, m5} , M(t1, t2) = {m1, m4, m6 }and M(t2, t3) = {m2, m7, m8} are three sets ofmetadata over time series 0, t1, t2 and t3, then threegraphs are obtained at those time-intervals as shownin Figure 2 as G1, G2 and G3. V0D and V0C formthe distributed partitioned hub vertex generated by thegraph construction method discussed in [23] , with V0C

the controller and V0D the delegate.

5.2 Algorithms Implementation Details. Theaddition of temporal metadata on the graph topologysupports implementation of various time dependent al-gorithm in HavoqGT. We discuss the implication of thetemporal information in the graph with respect to theconnectedness of a single vertex and show two algorithmimplementations, time dependent single source short-est path algorithm and temporal betweenness centrality.Our framework simplifies the design of algorithms fortime dependent algorithms to be run on HPC systemsby providing a programming abstraction. Algorithmsshown below highlights the use cases of such abstrac-tion along with the local access of the large temporalinformation to achieve scalable graph analytics on largetime dependent graphs.

5.2.1 Connected Vertices. A time dependent ver-sion of connected components is used to count connectedvertices from a single source for di↵erent start timesand waiting times. Waiting time is the maximum timea traversal can wait at each vertex before visiting itsneighbors; it can vary the range of traversable edges thathas significance in planning route in complex networks.The plot in Figure 3 shows the connected vertices froma single source in the CAIDA data. The count of con-nected vertices remains constant in the time-flattenedversion of our graph (traditional static graph model).In the time dependent graph, the start time and thewaiting time has significant impact on the count. Be-ginning the traversal at a later time showed compara-tively greater connectivity for waiting time 1-15 mins,which in context of TD-SSSP could yield in a shorterarrival time at the destination due to wider range ofpath choices.

5.2.2 Time Dependent Single Source ShortestPath. We use a label correcting approach to computethe time dependent single source shortest path (TD-SSSP) which records the shortest arrival time at eachvertex, and traverses the neighbors that exist betweenarrival time and arrival time + waiting time.

Page 28: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

The TD-SSSP begins by visiting the outgoing edgesof a source vertex that exist within the start-timeand waiting-time. The timestamp of traversed edgesdetermines the new arrival time at neighboring vertices.The TD-SSSP visitor is shown in Algorithm 1. Whenvisiting a vertex at a specific arrival time, it uses thefollowing three conditions: (a) If the arrival time at thecurrent vertex is less than the last recorded smallestarrival time, record this time as the shortest arrivaltime (Alg. 1, line 5); (b) Only visit the vertex if thenew arrival time is registered (Alg. 1, line 8); (c) Onlytraverse the neighbors that exist for waiting time periodafter the arrival time at this vertex (Alg. 1, line 15).

Algorithm 1 Time Dependent Single Source ShortestPath1: visitor state : vertex vertex to be visited

2: visitor state : arrival time path arrival time

3: visitor state : parent path parent

4: procedure pre visit(vertex data)5: if arrival time < vertex data.arrival time then6: vertex data.arrival time arrival time

7: vertex data.parent parent

8: return true

9: end if10: return false

11: end procedure

12: procedure visit(graph, visitor queue)13: for all ei Œout edges(graph, vertex) do14: metadata edges metadata(ei)15: not earlier arrival time + waiting time

16: � metadata.start time

17: not later arrival time < metadata.end time

18: edge exists not earlier and not later

19: if edge exists == true then20: new arrival time metadata.end time

21: new visitor td sssp visitor ( ei.target

22: , new arrival time, vertex)

23: visitor queue.push(new visitor)24: end if25: end for26: end procedure

27: procedure operator>( td sssp visitor a,td sssp visitor b )

28: return a.arrival time > b.arrival time

29: end procedure

We also record the parent of each vertex in theshortest path found from the source to this vertex, thusforming a shortest path tree from the source. This tree

yields the earliest arrival time from the source to eachvertex and also the actual waiting time spent at eachvertex in the shortest path from the source.

5.2.3 Temporal Betweenness Centrality. Weidentified two major steps in evaluating the unnormal-ized but exact betweenness centrality: a) Computingtemporal shortest path between all pairs, and b) Accu-mulating the count per vertex of the number of timesthe vertex is traversed in a path from the leaves of theshortest path tree of a source to its root per source. Weextended the TD-SSSP ( 5.2.2 ) to start the traversalfrom all the vertices and kept the record of the temporalshortest path from each source. Next, we employed adistributed recursive algorithm in HavoqGT to processthe TD-SSSP data per source to obtain the count ofshortest paths through each vertex.

The recursive algorithm starts with computing theout-degree of each vertex per shortest path tree froma single source. This out-degree count per shortestpath tree will help us identify whether or not all thechildren of a vertex have been processed so that it candecide if the path count collected so far is the completecount to pass it up to its parent. For each source,we start traversal from all vertices in source’s shortestpaths that have no children. The vertex’s parent isthen traversed and its child count decremented by one(Alg. 2, line 5) which signifies of a completed traversal atthat child. The count passed from that child representstotal number of shortest paths that go through thatchild and thus is added to the total count of shortestpath through this node (Alg. 2, line 6). The parent willcheck to see if its child count is zero (Alg. 2, line 7).If not, it terminates that traversal and returns (Alg. 2,line 10). Eventually, one of its children will come backto this parent and set the child count to zero at whichpoint, the parent starts its traversal towards its parent(Alg. 2, line 8).

6 Experimental Results

All our experiments have been executed on 150 Ter-aFLOP/s Catalyst, Linux HPC Cluster located onLawrence Livermore National Laboratory. The resultsare based on application of experimental distributedgraph analytics on one-hour period CAIDA flows having1.35 billion edges (⇠190GB).

Figure 3 summarizes the results obtained fromtraversing the time dependent CAIDA graph from anexample single source to record its connectivity to othervertices at various start time and waiting time. Thevariation in connectivity at various start time (plot-ted as x-axis in Figure 3) is further analyzed for di↵er-ent waiting time. Globally assigned waiting time for a

Page 29: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Algorithm 2 Temporal Betweenness Centrality1: visitor state : vertex vertex to be visited

2: visitor state : source source of the traversal

3: visitor state : count shortest path count

4: procedure pre visit(vertex data)5: vertex data [source] .child count

vertex data [source] .child count� 16: vertex data [source] .shortest path count

vertex data [source] .shortest path count

+count

7: if vertex data [source] .child count == 0 then8: return true

9: else10: return false

11: end if12: end procedure

13: procedure visit(graph, visitor queue)14: parent SSSP [source] .parent of (vertex)15: source data vertex data [source]16: count source data.shortest path count

17: new visitor bc visitor (parent, source, count)18: visitor queue.push (new visitor)19: end procedure

traversal shows significant e↵ect in changing the resultsof various well-known graph algorithms and thus im-pacts the computation time correspondingly. The vari-ation shown in Figure 3 shows the reality of the under-lying data more accurately than the time aggregatedversion which would always treat the connectivity thesame across all time ranges.

6.1 Temporal Betweenness Centrality. The un-normalized temporal betweenness centrality value offour example vertices with respect to the path wait-ing time is shown in Figure 4. With the experimentedCAIDA dataset, the plotted vertices show an increas-ing trend in BC value with increasing waiting time atvertex, as greater waiting time can make more neigh-bors traversable. The implication of such results canbe to find vertices with maximum increasing trend ofbecoming central within a desired time range.

The computation time of temporal betweennesscentrality is shown in Figure 5. The performance variesgreatly as a function of traversal start-time and waiting-time. The variance for a range of start times is shownusing 64-compute nodes of the Catalyst cluster. Foreach waiting-time value (x-axis), 20 di↵erent starting-times are shown by the variance bars. As the waiting-time is increased, the traversal complexity increases as

Figure 4: Plot of unnormalized exact BetweennessCentrality of four example vertices for varying waitingtime of the shortest path traversal for a fixed startingtime of traversal.

Figure 5: Plot of variance of the computation time forexact betweenness centrality executed on 64 computenodes for traversal start time within first 20 minutes ofthe test data with respect to waiting time period fromhalf-an-hour to the full hour. Intuitively, more com-putation is required for higher waiting time at earlierstart times as more neighboring edges are available aswe begin early and wait longer.

Page 30: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Figure 6: Plot of variance of computation time acrossvarious start time with respect to the number of com-pute nodes used to process the time dependent exactbetweenness centrality.

the number of eligible paths increases.Finally, Figure 6 shows the result of testing strong

scaling of the betweenness centrality executed on theCatalyst cluster. Variation in computation time withrespect to di↵ering start times was computed for aninfinite waiting time in order to test the scalability ofthis framework. The computation time decreases withan increasing number of compute nodes, and scales near-linearly up to approximately 32 compute nodes. Thisstrong-scaling limit is not uncommon for distributedgraph analytics. In our future work, we will investigateweak-scaling to larger datasets.

7 Summary

We presented our ongoing work to annotate a scale-freegraph in distributed memory with temporal metadatato model a time dependent graph. With the increasedcapability of our distributed framework for high per-formance temporal graph analytics, HavoqGT, we haveshown implementations for multiple important graphanalytics on time dependent graphs. We provide ini-tial scalability results on a data-intensive HPC clusterfor temporal algorithms using two global time parame-ters start-time and waiting-time, to provide analysis ontime dependent graphs derived from CAIDA datasets.

8 Acknowledgments

This work was performed under the auspices ofthe U.S. Department of Energy by Lawrence Liver-more National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-676816). Funding was par-tially provided by LDRD 13-ERD-025. Experimentswere performed at the Livermore Computing facility.

References

[1] “HavoqGT,” in http: // havoqgt. bitbucket. org/ .[2] P. Holme and J. Saramaki, “Temporal networks,”

Physics Reports, vol. 519, no. 3, pp. 97 – 125, 2012.Temporal Networks.

[3] A. Orda and R. Rom, “Shortest-path and minimum-delay algorithms in networks with time-dependentedge-length,” Journal of the ACM (JACM), vol. 37,no. 3, pp. 607–625, 1990.

[4] A. K. Ziliaskopoulos and H. S. Mahmassani, “Time-dependent, shortest-path algorithm for real-time in-telligent vehicle highway system applications,” Trans-portation research record, pp. 94–94, 1993.

[5] J. Tang, C. Mascolo, M. Musolesi, and V. Latora, “Ex-ploiting temporal complex network metrics in mobilemalware containment,” in World of Wireless, Mobileand Multimedia Networks (WoWMoM), 2011 IEEE In-ternational Symposium on a, pp. 1–9, IEEE, 2011.

[6] T. Yasseri, R. Sumi, and J. Kertesz, “Circadian pat-terns of wikipedia editorial activity: A demographicanalysis,” PloS one, vol. 7, no. 1, p. e30091, 2012.

[7] B. Ding, J. X. Yu, and L. Qin, “Finding time-dependent shortest paths over large graphs,” in Pro-ceedings of the 11th international conference on Ex-tending database technology: Advances in databasetechnology, pp. 205–216, ACM, 2008.

[8] A. Yoo, A. Baker, R. Pearce, and V. Henson, “Ascalable eigensolver for large scale-free graphs using2D graph partitioning,” in Supercomputing, pp. 1–11,2011.

[9] A. Buluc and K. Madduri, “Parallel breadth-firstsearch on distributed memory systems,” in Supercom-puting, 2011.

[10] F. Checconi, F. Petrini, J. Willcock, A. Lumsdaine,A. R. Choudhury, and Y. Sabharwal, “Breaking thespeed and scalability barriers for graph explorationon distributed-memory machines,” in Supercomputing,2012.

[11] “The Graph500 benchmark,” in www. graph500. org .[12] K. Madduri, D. Ediger, K. Jiang, D. Bader, and

D. Chavarria-Miranda, “A faster parallel algorithmand e�cient multithreaded implementations for eval-uating betweenness centrality on massive datasets,”in Parallel Distributed Processing, 2009. IPDPS 2009.IEEE International Symposium on, pp. 1–8, May 2009.

[13] A. Buluc and K. Madduri, “Parallel breadth-firstsearch on distributed memory systems,” in Proceed-

Page 31: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

ings of 2011 International Conference for High Perfor-mance Computing, Networking, Storage and Analysis,SC ’11, (New York, NY, USA), pp. 65:1–65:12, ACM,2011.

[14] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, andC. Guestrin, “Powergraph: Distributed graph-parallelcomputation on natural graphs,” in Proceedings ofthe 10th USENIX Conference on Operating SystemsDesign and Implementation, OSDI’12, (Berkeley, CA,USA), pp. 17–30, USENIX Association, 2012.

[15] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Ky-rola, and J. M. Hellerstein, “Distributed graphlab: Aframework for machine learning and data mining in thecloud,” Proc. VLDB Endow., vol. 5, pp. 716–727, Apr.2012.

[16] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,I. Horn, N. Leiser, and G. Czajkowski, “Pregel: Asystem for large-scale graph processing,” in Proceedingsof the 2010 ACM SIGMOD International Conferenceon Management of Data, SIGMOD ’10, (New York,NY, USA), pp. 135–146, ACM, 2010.

[17] R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica,“Graphx: A resilient distributed graph system onspark,” in First International Workshop on GraphData Management Experiences and Systems, GRADES’13, (New York, NY, USA), pp. 2:1–2:6, ACM, 2013.

[18] C. Cattuto, M. Quaggiotto, A. Panisson, and A. Aver-buch, “Time-varying social networks in a graphdatabase: A neo4j use case,” in First InternationalWorkshop on Graph Data Management Experiencesand Systems, GRADES ’13, (New York, NY, USA),pp. 11:1–11:6, ACM, 2013.

[19] Y. Simmhan, N. Choudhury, C. Wickramaarachchi,A. Kumbhare, M. Frincu, C. Raghavendra, andV. Prasanna, “Distributed programming over time-series graphs,” in Parallel and Distributed Process-ing Symposium (IPDPS), 2015 IEEE International,pp. 809–818, May 2015.

[20] W. Han, Y. Miao, K. Li, M. Wu, F. Yang, L. Zhou,V. Prabhakaran, W. Chen, and E. Chen, “Chronos: Agraph engine for temporal graph analysis,” in Proceed-ings of the Ninth European Conference on ComputerSystems, EuroSys ’14, (New York, NY, USA), pp. 1:1–1:14, ACM, 2014.

[21] R. Pearce, M. Gokhale, and N. M. Amato, “Multi-threaded asynchronous graph traversal for in-memoryand semi-external memory,” in Proceedings of the 2010ACM/IEEE International Conference for High Perfor-mance Computing, Networking, Storage and Analysis,SC ’10, (Washington, DC, USA), pp. 1–11, IEEE Com-puter Society, 2010.

[22] R. Pearce, M. Gokhale, and N. M. Amato, “Scal-ing techniques for massive scale-free graphs in dis-tributed (external) memory,” in Proceedings of the2013 IEEE 27th International Symposium on Paralleland Distributed Processing, IPDPS ’13, (Washington,DC, USA), pp. 825–836, IEEE Computer Society, 2013.

[23] R. Pearce, M. Gokhale, and N. M. Amato, “Faster

parallel traversal of scale free graphs at extreme scalewith vertex delegates,” in Proceedings of the Interna-tional Conference for High Performance Computing,Networking, Storage and Analysis, SC ’14, (Piscataway,NJ, USA), pp. 549–559, IEEE Press, 2014.

[24] CAIDA, “The CAIDA UCSD Anonymized Inter-net Traces.” http://www.caida.org/data/overview/,2008.

[25] L. C. Freeman, “A set of measures of centrality basedon betweenness,” Sociometry, pp. 35–41, 1977.

Page 32: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

GSK: Graph Sparsification as a Knapsack problem formulation

Hongyuan Zhan Kamesh MadduriThe Pennsylvania State University

[email protected], [email protected]

Abstract

Graph edge sparsification, or filtering a substantialnumber of edges in a methodical manner, has severalgraph analysis-related uses. In this paper, we presentGSK, a new problem formulation for graph edgesparsification based on the knapsack problem. Weshow that several prior methods for edge sparsificationcan be expressed as instances of the proposed GSKproblem. We implement a fast linear-time approxima-tion scheme to solve the GSK optimization problem.We also present a preliminary empirical evaluation ofGSK-based sparsification strategies, comparing themto prior methods on a collection of test graphs.

Keywords: sparsification, knapsack, edge centrality.

1 Introduction

We present a simple formulation for graph sparsificationbased on the 0/1 knapsack problem. Informally, the0/1 knapsack problem can be stated as follows: givena collection of items, each with an associated weightand value, the goal is to choose a subset of the itemssuch that the cumulative value of the chosen items ismaximized, and the total weight of the selected subsetof items is less than or equal to a user-specified weightbudget. The main idea of our GSK formulation is totreat each edge in the graph as an item and to asso-ciate weights and values to them, so that sparsificationstrategies can be appropriately defined.

When sparsifying a graph, we would like to preservestructural or statistical properties of the original graphas much as possible, but reduce the number of edges.We assume that the vertex set V remains the same.Let G(V, E) denote the graph after sparsification. Wewould like to minimize the loss of information, modeledby an appropriate loss function L(G, G). Since the setof vertices is unchanged, we may write minL(G, G)as max f(E) for some function f . User-defined edgeweights or costs {ce | e 2 E} provide fine-grained controlover whether an edge should be included or not. Auser-specified knapsack cost upper bound W can alsobe set appropriately, in the same units as c, to control

the sparsity of the simplified network. When f is alinear function, i.e., associating a set of profits or values{pe | e 2 E} with edges, graph sparsification can bewritten as a linear knapsack problem:

(GSK) maxE✓E

X

e2E

pe subject toX

e2E

ce W.

A natural way to use this problem formulation would beto set profits to computed global or local edge centralityvalues. Edge centrality values attempt to quantitativelyrank edges. Sparsification using the above knapsackformulation would then mean filtering low centralityedges, while satisfying user-specified linear constraints.

This paper makes the following contributions: InSection 2, we show that a number of prior edge sparsifi-cation methods can be expressed as special cases of theproblem (GSK). We implement a fast solution strategyfor this problem, discussed in Section 3. In Section 4,we perform a preliminary exploration of the design spaceenabled by our knapsack formulation, by encoding pairsof complementary edge centrality values as profits andcosts. We also empirically evaluate these sparsificationstrategies and the performance of our solver on severaltest graph instances in Section 5.

2 Transforming Prior Methods to GSK

Graph edge sparsification is performed for a variety ofreasons. We briefly discuss the motivation for each ofthe following strategies, and then describe how they canbe recast as instances of our proposed knapsack-basedproblem.

2.1 Backbone Extraction. A large collection of re-lational data sets can be modeled as weighted graphs.Backbone extraction refers to the problem of determin-ing a reduced representation of the original networkwith fewer edges. The backbone highlights the structureof the network and preserves key characteristics. Ser-rano et al. [23] propose a method to extract backbonesfrom weighted graphs. First, edge weights are normal-ized locally. Let zij denote the weight of edge hi, ji.The weight is changed to wij = zij/

P{k|hi,ki2E} zik.

Page 33: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

The denominator is the sum of the weights of all edgesconnected to vertex i. In case of undirected graphs, eachedge is considered twice, and weights are normalized ac-cording to vertices i and j separately. In the next step, astatistical test against a null model of a normalized edgeweight distribution is used to determine edges that arestatistically significant. The null model assumes thatnormalized edge weights of a degree-k vertex are pro-portional to random assignments of k sub-intervals from[0, 1]. Under this null model, the probability of observ-ing an edge with normalized weight of at least wij is

↵ijdef= 1� (k� 1)

R wij

0 (1� x)k�2dx. A local significancelevel ↵ is selected, and if ↵ij < ↵, the observed edgeis considered to be statistically significant and retained.Other edges are filtered.

Foti et al. [9] consider a related approach usinga non-parametric statistical test. The same weightnormalization step is performed, but the statistical testis replaced by ↵ij = 1 � 1

di

P{k|hi,ki2E} 1{wik wij}.

Here, 1{·} denotes the indicator function, which equals1 if the expression in the bracket is true, and 0 otherwise.di denotes the degree of vertex i. ↵ij is considered tobe the probability of empirically observing an edge withnormalized weight at least wij locally at vertex i. Thesetwo backbone extraction methods can be viewed as aspecial case of the knapsack problem:

maxE✓E

X

i2V

X

hi,ji2E

(↵� ↵ij).

The above problem can be considered a Lagrangianrelaxation of problem (GSK), and with no constraints.This problem can be easily solved by selecting edgeswith positive profits, i.e., ↵� ↵ij > 0.

2.2 Similarity Filters for Clustering. Satuluri etal. [22] use edge sparsification as a preprocessing stepto enhance scalability of graph vertex clustering algo-rithms. The goal is not to sacrifice the quality of clus-tering results with the sparsified graph. Each edge hi, jiis evaluated based on the similarity of vertices i and j.Similarity is defined using the Jaccard score:

Jacij =|N(i) \N(j)||N(i) [N(j)| ,

where N(i) refers to the set of neighboring vertices of i.The authors propose two sparsification schemes basedon the Jaccard score. The global similarity method firstranks each edge in the network using Jac; then the tops% edges ranked by Jac are retained in the sparsifiednetwork. Therefore, the global similarity method couldbe viewed as problem (GSK) with unit edge weights,

i.e., a edge cardinality constraint:

maxE✓E

X

e2E

Jace subject to |E| W.

In addition to filtering edges globally, the authors alsoconsider ranking edges locally at each vertex similarto [23] and [9]. Starting with a network of the samevertex set and an empty edge set, for each vertex i 2 V ,the top dd↵i e (exponent ↵ 2 (0, 1)) of the incident edgesin the original network ranked by similarity score Jacijare added back. The intuition is that similar vertices aremore likely to reside in the same cluster, and so intra-cluster edges are retained. This problem can be viewedas an unconstrained knapsack problem by setting theprofit values of edges that are to be filtered to zero, andthe profit values for the edges to be retained to theirJaccard scores. We could impose additional filteringconstraints to complement this local similarity filter, aswe will discuss in Section 4.

2.3 Local Degree Filter. Lindner et al. [19] re-cently evaluate a degree-based sparsification strategy.Similar to the Local Similarity filter method, each edgeis first rated using a score function, and then a filter-ing threshold is applied locally at each vertex. For eachedge hi, ji incident on vertex i, the score function Degijchosen here is the degree of vertex j. For each vertex i,the top-ranked dd↵i e of attached edges are retained. Themotivation behind using vertex degree for filtering is topreserve key graph vertices, or hub vertices, after spar-sification. Analogous to the Local Similarity method,this method can be written as a special unconstrainedcase of problem (GSK).

2.4 Other Related Sparsification Methods.Graph sparsification has been studied in several dif-ferent contexts. Cut sparsifiers [3, 10] aim at creatinga graph with the same set of vertices, but with feweredges, such that every cut on the graph is preservedup to a multiplicative factor. Spectral sparsifiers bySpielman and Teng [28] is another well-known line ofwork. The goal here is to preserve the spectral prop-erties of the graph Laplacian. Bonchi et al. [5] definethe problem of activity-preserving graph simplification.Their approach requires additional information on ac-tivity traces in the network. The goal is to preserveobserved activity traces while simplifying the graph.

A recent paper by Wilder and Sukthankar [29]formulates a sparsification method for social networkswhich aims to preserve the stationary distribution ofrandom walks. Their method can be expressed as anonlinear optimization problem with linear knapsack

Page 34: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

constraints [6, 24]:

maxE✓E

f(E) subject toX

e2E

ce W.(GSK-NL)

Optimization problems of the form (GSK-NL) havebeen studied and applied to other problems relatedto network analysis. Leskovec et al. [18] studied theproblem of selecting a subset of vertices in a networkfor sensor placement under budget constraints. Thisapplication could also be viewed as a vertex-filteringbased graph sparsification problem.

3 A Fast Approximation Scheme

In this section, we discuss our implementation of a fastapproximation algorithm for the 0/1 knapsack problem.Note that the sparsification methods discussed in theprior section – backbone extraction [23, 25], Local Sim-ilarity Filter [22], and Local Degree Filter [19] – arestraightforward to implement and do not need to besolved under the knapsack formulation. We want tocreate a general sparsification framework that permitsglobal or local edge centrality-based filtering, with theflexibility of introducing user-defined linear constraints.

Given integer profits and weights, the problem(GSK) can be solved exactly in pseudopolynomial timeusing a dynamic programming-based scheme [2, 7, 13].However, this approach is prohibitively expensive forlarge graphs, and in our sparsification scenarios, itis not guaranteed that profits and weights are inte-gers. Approximation algorithms for the 0/1 knap-sack problem and variants have been studied exten-sively [1, 8, 11, 12, 14, 16, 20, 21]. There are two mainkinds of approximation algorithms for knapsack: fully-polynomial time approximation schemes (FPTAS) thatcan achieve a result that is 1�✏ of the optimal value forall input error tolerance values ✏ [11, 12, 16, 21]; linear-time greedy approximation algorithms that have ap-proximation quality guarantees as a fixed percentage ofthe optimal value [1, 8, 20]. The complexity of FPTASalgorithms is independent of the knapsack cost boundW , but polynomial in the number of items. In this sec-tion, we apply a greedy approximation algorithm basedon [1, 14, 18] and explain how it can be implemented inO(|E|) time. We assume that the profits {pe|e 2 E}and the costs {ce|e 2 E} are precomputed.

The commonly-used greedy approximation methodfor the knapsack problem [1,8,14] starts with an emptyset EF

0 , and expands EFk to EF

k+1 = EFk [ {eFk } by

adding the edge eFk that gives the maximal “e�ciency”,

eFk = maxe2E\EF

k

pece

,

until reaching the kF such thatPkF�1

i=0 ceFi W <

PkF

i=0 ceFi . Selecting [kF�1i=0 eFi or {eFkF } that corresponds

to max{PkF�1

i=0 peFi , peFkF} gives a 2-approximation to

the knapsack problem [1, 8, 14]. However, in thecontext of graph simplification formulation (GSK), if

peFkF

>PkF�1

i=0 peFi , constructing a simplified graph

G = (V, {eFkF }) with just a single edge {eFkF } may notbe really usable in the context of graph sparsification.However, since the linear objective f(E) =

Pe2E pe is

a submodular function with f(EA [ {e}) � f(EA) =f(EB [ {e}) � f(EB) = pe, the greedy algorithm forsubmodular function maximization under knapsack con-straints [15,18] could be applied here. Analogous to ex-panding the edge set based on the greedy e�ciency rule,the greedy profit rule initializes EP

0 = ; and selects ePkin the kth iteration that gives the largest profit:

ePk = maxe2E\EP

k

pe

until the iteration kP such that adding ePkP willexceed the knapsack weight constraint in problem(GSK). Leskovec et al. [18] show that using

max{PkF�1

i=0 peFi ,PkP�1

i=0 pePi } results in a solution that

is at least 12 (1 � 1/e) of the optimal solution for gen-

eral non-decreasing submodular functions. Hence, thisresult is applicable to problem (GSK) as well. The caseof a single edge being retained is unlikely to occur forgraphs we consider and when using edge centrality mea-sures such as the Jaccard score.

The edge sets {eFi }kF�1

i=0 and {ePi }kP�1

i=0 can be de-termined by sorting {pe}e2E and {pe/ce}e2E . A fastermethod that avoids sorting is to use the weighted medi-ans.Given a list of profits {pe|e 2 E} and nonnegativecosts {ce|e 2 E}, define the weighted median profit tobe the profit p⇤ 2 {pe|e 2 E} such that

X

e:pe>p⇤

ce < W X

e:pe�p⇤

ce.

Similarly, let fe = pe/ce, and define the weighted median

e�ciency to be the f⇤ 2 {fe|e 2 E} satisfying

X

e:fe>f⇤

ce < W X

e:fe�f⇤

ce

Using a slight modification of the selection sortalgorithm, we can find the weighted medians p⇤ and f⇤

in O(|E|) time [4, 7, 14]. After obtaining the weighted

medians p⇤ and f⇤, the set {eFi }kF�1

i=0 and {ePi }kP�1

i=0 ofselected edges can be decided in two passes over theset of all edges. We describe this procedure for thegreedy profit strategy. Let EP denote the set of edgesselected by the greedy profit rule. In one pass over E,

Page 35: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

we partition the edges into {e | pe > p⇤}, {e | pe = p⇤}and {e | pe < p⇤}. All edges in {e | pe > p⇤} canbe added safely to EP and all edges in {e | pe < p⇤}can be disregarded by the greedy rule. During the firstpass, the remainder weight R = W �

Pe:pe>p⇤ ce is

computed. The second pass over {e | pe = p⇤} canbe performed in arbitrary order until the remainderweight R is exhausted. Therefore, the running time isO(|E|). The greedy e�ciency rule can be implementedin a similar manner. The pseudocode for the full solveris provided in Algorithm 1.

Algorithm 1 Pseudocode for the GSK problem solver.

Input: G(V,E), profits {pe|e 2 E}, costs {ce|e 2 E},weight constraint W .Output: E1: EP ;2: S(P) 0 . greedy profit rule solution value3: R(P) W . remainder weight4: Set p⇤ to computed weighted median profit5: Initialize array T (P)6: for e 2 E do . O(|E|)7: if pe > p⇤ then

8: add e to EP

9: S(P) S(P) + pe10: R(P) R(P)� ce11: else if pe = p⇤ then

12: add e to T (P)

13: add e 2 T (P) to EP until R(P) is exhausted14: for e 2 E do

15: Compute e�ciency fe pe/ce

16: EF ;17: S(F) 0 . greedy e�ciency rule solution value18: R(F) W . remainder weight19: Set f⇤ to computed weighted median e�ciency20: Initialize array T (F)21: for e 2 E do . O(|E|)22: if fe > f⇤

then

23: add e to EF

24: S(F) S(F) + pe25: R(F) R(F)� ce26: else if fe = f⇤

then

27: add e to T (F)

28: add e 2 T (F) to EF until R(F) is exhausted29: if S(P) � S(F) then30: return EP

31: else

32: return EF

4 Constructing New Sparsification Schemes

We now discuss approaches to set the edge profits{pe | e 2 E} and costs {ce | e 2 E} in the GSKformulation. We have two goals in mind: First, theprofits should attempt to preserve a structural property.

Algorithm 2 DistributeCentrality

Input: network G(V,E), centrality {s(u)|u 2 V }Output: edge scores {s(u, v)|u 2 V, hu, vi 2 E}1: if G is undirected then

2: for e 2 E do

3: u, v end points of e4: create directional edges hu, vi and hv, ui5: for u 2 V do

6: for 8hu, vi attached to u do

7: s(u, v) ⇣d(v)/

Phu,zi2E

d(z)⌘s(u)

Second, we want to reuse precomputed vertex/edgecentrality scores or other information available aprioriin order to set the edge costs. For example, in case ofonline social network analysis, suppose the PageRankscores for each vertex have been already computed.How can we use this information to sparsify the graphand perform additional network analysis tasks? Ideally,we might wish that after sparsification, the set oftop-ranked vertices using PageRank remains the same,and additionally other properties such as clusteringcoe�cient distribution or community structure are alsopreserved. More generally, given a set of precomputedvertex centrality scores {s(u) | u 2 V }, we devise aheuristic to construct a set of edge scores {s(u, v) | u 2V, hu, vi 2 E}. If our primary interest is to preserve thecentrality rankings, then the edge scores could be useddirectly as profits. Otherwise, the edge scores couldbe transformed into costs, which would discriminatethe edges based on their contribution to the vertexcentralities in the original network.

We propose using a simple method for constructingedge scores that works on both directed and undirectedgraphs. For directed graphs, on each vertex u 2 V , thecentrality score s(u) is distributed among its outgoing(or incoming) edges {hu, vi 2 E}. Each outgoing edgee = hu, vi obtains a fraction of s(u), proportional tod(v)/

Phu,zi2E d(z). The intuition for distributing the

centrality scores on outgoing edges is that the influenceof a vertex is more likely to propagate along its high-degree neighbors. For undirected graphs, we duplicateeach edge e 2 E to create two directed links hu, vi andhv, ui. This operation enlarges size of the knapsackdecision variables from |E| to 2|E|. After solvingproblems (GSK), an edge e is retained in the simplifiednetwork if at least one of hu, vi and hv, ui is selected inthe knapsack solution. We summarize this process ofassigning edge scores in Algorithm 2.

Let DPRij denote the score on edge hi, ji whenAlgorithm 2 is applied to generate edge scores basedon PageRank. Since DPRij 2 (0, 1], 1 � DPRij could

Page 36: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

be used as edge costs, so that a weak edge in thesense of contribution to PageRank has a large cost inproblem (GSK). We extend the Jaccard score-basedLocal Similarity method and Global Similarity methodsof Satuluri et al. [22] to use (1 � DPRij) as edge costs,and enforcing constraints

X

i2V

X

hi,ji2E

(1� DPRij) W,

where W is a user-supplied parameter. W is set to be apercentage of 2|E|, since

Pi2V

Phi,ji2E(1 � DPRij) =

2|E| � 1. We also devise a new method that usesDPRij as the profit for edge hi, ji, and enforces the unit-cost constraint

Pi2V

Phi,ji2E 1 W . The motivation

of this method is to preserve the vertices with topPageRank scores after sparsification.

Table 1: Graphs used in empirical evaluation.

Name |V | (⇥106) |E| (⇥106)

com-Amazon [30] 0.335 0.9com-DBLP [30] 0.32 1.05com-Youtube [30] 1.13 3roadNet-CA [17] 1.97 5.53as-Skitter [17] 1.7 11com-LiveJournal [30] 4 34.68com-Orkut [30] 3.07 117

5 Empirical Evaluation

In this section, we empirically evaluate six sparsificationmethods that can be expressed in the form of eitherproblem (GSK) or its simpler unconstrained version.We used seven sparse graphs from the Stanford largenetwork dataset collection SNAP [26], listed in Table 1.We implemented the approximate knapsack solutionscheme in C+ and verified correctness with several testinstances. The six sparsification methods consideredare summarized in Tables 2 and 3. All experimentswere run on a single server of Cyberstar, a Penn Statecompute cluster. The server we run our programs onis a dual-socket quad-core Intel Nehalem system (IntelXeon X5550 processor) with 32 GB main memory.

5.1 Solver Performance. We evaluate performanceof the solver for the problem (GSK). Unlike the exactpseudopolynomial dynamic programming approach, therunning time of Algorithm 1 is independent of the user-supplied weight parameter W and the actual settingsfor the edge profits {pe|e 2 E} and costs {ce|e 2 E}.We computed the execution time of the solver for theseven graphs in Table 1. The graph sizes di↵er by nearly

0 20 40 60 80 100 120

0

50

100

150

200

number of edges (millions)

runnin

g tim

e(s

eco

nds)

com−Amazoncom−DBLPcom−YoutuberoadNet−CAas−Skittercom−LiveJournalcom−Orkut

Figure 1: Solver running time on various graphs.

two orders of magnitude, going from com-Amazon tocom-Orkut. Figure 1 gives the running times. As thenumber of graph edges increases, the execution time ap-pears to increase linearly. This empirical evaluation isone way of demonstrating the practical e�cacy of thesolver. Note that this figure only gives the solver execu-tion time, and does not include the time taken to com-pute profits and costs. Our current implementations forPageRank and Jaccard scores are straightforward anduntuned. For com-DBLP, the running times of JLPR forPageRank computation (costs), Jaccard score compu-tation (profits), and the greedy approximation are 0.21,2.28, and 2.1 seconds, respectively.

5.2 Comparing Sparsification Methods. We nowapply the solver to the six problem formulations in Ta-bles 2 and 3. In addition, we sparsify the graphs us-ing uniform random edge (RE) sampling as a baseline.The abbreviations used in the figures correspond to themethods in Tables 2 and 3. The knapsack bound Wis user-supplied. We set the problem parameters toachieve various sparsification ratios of |E|/|E|. We usethree structural properties to evaluate the sparsificationschemes: vertex degree distribution, top-ranked Page-Rank vertices and average clustering coe�cient. We useSpearman’s rank correlation coe�cient [27] to computecorrelation between vertex degree rankings in the spar-sified graph and in the original graph. This value will beclose to 1 if the degree rankings are highly correlated.For PageRank, we evaluate the methods based on theproportion of the top 10% PageRank-ordered verticesin the original graph that are preserved after sparsifi-cation. We report experimental results on com-Amazon,com-DBLP, com-Youtube and com-LiveJournal.

Page 37: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Table 2: Problem (GSK)-based graph sparsification strategies used in empirical evaluation.

Label p(i, j) Constraint Motivation

JLPR 1di

P{k|hi,ki2E} 1{Jacik Jacij}

Pi2V

Phi,ji2E(1� DPRij) W local similarity + PageRank

JGPR JacijP

i2V

Phi,ji2E(1� DPRij) W similarity + PageRank

PR DPRijP

i2V

Phi,ji2E 1 W PageRank

JG JacijP

i2V

Phi,ji2E 1 W similarity [22]

Table 3: Unconstrained sparsification strategies used in empirical evaluation.

Label p(i, j) Motivation

DL retain top dd↵i e edges ranked by Degij locally hub vertices [19]JL retain top dd↵i e edges ranked by Jacij locally local similarity [22]

50% 60% 70% 80% 90% 100%

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Retained Edges

Spearm

an‘s

corr

ela

tion o

f D

egre

e

com−Amazon

REPRDLJLPRJGPR

50% 60% 70% 80% 90% 100%

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Retained Edges

Spearm

an‘s

corr

ela

tion o

f D

egre

e

com−DBLP

REPRDLJLPRJGPR

50% 60% 70% 80% 90% 100%

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Retained Edges

Spearm

an‘s

corr

ela

tion o

f D

egre

e

com−LiveJournal

REPRDLJLPRJGPR

50% 60% 70% 80% 90% 100%

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Retained Edges

Spearm

an‘s

corr

ela

tion o

f D

egre

e

com−Youtube

REPRDLJLPRJGPR

Figure 2: Spearman’s rank coe�cient of degree for the original graph and sparsified graph, at various sparsificationthresholds.

Figure 2 plots the Spearman’s correlation for vertexdegree rankings at various sparsification ratios. Thereis no single method that consistently outperforms oth-

ers. In [19], the authors perform a similar experiment,for vertex degree ranking using Spearman’s rank corre-lation coe�cient, over a large collection of social net-

Page 38: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

50% 60% 70% 80% 90% 100%

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Retained Edges

PageR

ank−

ord

ere

d v

ert

ex

ove

rlap r

atio

com−Amazon

REPRDLJLPRJGPR

50% 60% 70% 80% 90% 100%

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Retained Edges

PageR

ank−

ord

ere

d v

ert

ex

ove

rlap r

atio

com−DBLP

REPRDLJLPRJGPR

50% 60% 70% 80% 90% 100%

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Retained Edges

PageR

ank−

ord

ere

d v

ert

ex

ove

rlap r

atio

com−LiveJournal

REPRDLJLPRJGPR

50% 60% 70% 80% 90% 100%

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Retained Edges

PageR

ank−

ord

ere

d v

ert

ex

ove

rlap r

atio

com−Youtube

REPRDLJLPRJGPR

Figure 3: The fraction of vertices that overlap in the top 10% PageRank-ordered vertex sets in the original andthe sparsified graph, at various sparsification thresholds.

50% 60% 70% 80% 90% 100%

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Retained Edges

Spearm

an‘s

corr

ela

tion o

f D

egre

e

com−LiveJournal

JLJLPRJGJGPR

50% 60% 70% 80% 90% 100%

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Retained Edges

PageR

ank−

ord

ere

d v

ert

ex

ove

rlap r

atio

com−LiveJournal

JLJLPRJGJGPR

Figure 4: Comparing Jaccard similarity-based sparsifier performance on the com-LiveJournal graph.

Page 39: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

works. Their results show that DL and RE consistentlyoutperform JL. In our experiments, RE, PR, DL, andJLPR all perform qualitatively similarly on the socialnetwork com-LiveJournal. However, the performancediverges on the other graphs. PR’s performance is alsonoteworthy: it performs significantly worse than RE forcom-Amazon and com-DBLP, but better than all methodsfor com-Youtube.

Results showing the overlap ratio of top 10%PageRank-ordered vertices after applying a sparsifica-tion method are shown in Figure 3. As expected, PRconsistently outperforms the other methods, since inPR, the edge profit DPR is based on the vertex PageRankscores. DL and JLPR give similar results for the overlapratio in all four graphs, and slightly underperform PR.JGPR sparsification significantly distorts the top 10%PageRank-ordered vertex set in all experiments.

The comparison of JL, JLPR, JG, and JGPR on theSpearman’s correlation of degrees and the overlap ratioof top 10% PageRank vertices are displayed in Figure 4.For all the graphs, the di↵erence between JL and JLPR,and the di↵erence between JG and JGPR is very small.Hence we do not show JL and JG results in Figures 2and 3.

We report the deviation of average clustering coe�-cient Ave-CC from the original graph after sparsificationon com-DBLP and com-Amazon in Figure 5. The devia-tion is computed as Ave-CC(G)�Ave-CC(G). The aver-age clustering coe�cients for com-DBLP and com-Amazon

are 0.6324 and 0.3967, respectively. These two graphshave the largest average clustering coe�cient among theones listed in Table 1. Surprisingly, PR causes the leastdeviation of average clustering coe�cient on com-DBLP.JLPR and JGPR, which aim to preserve intra-clusteredges, produce positive deviations at some sparsificationthresholds on com-Amazon. RE reduces the average clus-tering coe�cient linearly on decreasing the number ofretained edges, an observation also pointed out in [19].

6 Conclusions and Future Work

In this work, we explore using the knapsack problem forgraph edge sparsification. Sparsifying large graphs aidsin graph visualization and serves as a speedup techniquefor graph computations such as community detectionand centrality analysis. Our proposed knapsack-basedsparsification permits both global and local centrality-based edge filtering, and additionally lets us specifyfine-grained linear constraints. We implement a greedylinear-time approximation solution scheme for this prob-lem. Our preliminary empirical evaluation looks at twonew formulations and compares them to known edgesparsification strategies using three criteria. We planto perform a detailed empirical evaluation in future

work, by expanding the collection of input graphs, theevaluation criteria (studying sparsification impact onother topological properties; studying impact of greedysolver on sparsification quality), and the methods cho-sen (include advanced sampling-based methods, as wellas other new knapsack problem combinations). We willalso attempt to identify cases where a general nonlinearsubmodular function can be approximated as a piece-wise linear function. For cases where such approxima-tions are possible, our knapsack solver can be readilyused. We also hope to demonstrate, in future work, useof our sparsification schemes for scalable graph visual-ization and as a general speedup strategy.

Acknowledgments

We thank the reviewers for their thoughtful com-ments and suggestions to improve the paper. Thiswork was supported in part through instrumentationfunded by the National Science Foundation throughgrant OCI-0821527. This work is also supported byNational Science Foundation grants ACI-1253881 andCCF-1439057.

References

[1] E. Balas and E. Zemel. An algorithm for large zero-oneknapsack problems. operations Research, 28(5):1130–1154, 1980.

[2] R. Bellman. Notes on the theory of dynamic program-ming iv-maximization over discrete sets. Naval Re-search Logistics Quarterly, 3(1-2):67–70, 1956.

[3] A. A. Benczur and D. R. Karger. Randomized ap-proximation schemes for cuts and flows in capacitatedgraphs. SIAM Journal on Computing, 44(2):290–319,2015.

[4] M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, andR. E. Tarjan. Time bounds for selection. Journal ofcomputer and system sciences, 7(4):448–461, 1973.

[5] F. Bonchi, G. D. F. Morales, A. Gionis, and A. Ukko-nen. Activity preserving graph simplification. DataMining and Knowledge Discovery, 27(3):321–343, 2013.

[6] K. M. Bretthauer and B. Shetty. The nonlinear knap-sack problem–algorithms and applications. EuropeanJournal of Operational Research, 138(3):459–472, 2002.

[7] T. H. Cormen, C. E. Leiserson, R. L. Rivest, andC. Stein. Introduction to Algorithms (3. ed.). MITPress, 2009.

[8] D. Fayard and G. Plateau. An algorithm for thesolution of the 0–1 knapsack problem. Computing,28(3):269–287, 1982.

[9] N. J. Foti, J. M. Hughes, and D. N. Rockmore. Non-parametric sparsification of complex multiscale net-works. PLoS ONE, 6(2):e16431, 2011.

[10] W. S. Fung, R. Hariharan, N. J. Harvey, and D. Pan-igrahi. A general framework for graph sparsification.

Page 40: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

50% 60% 70% 80% 90% 100%

−0.4

−0.3

−0.2

−0.1

0

0.1

Percentage of Retained Edges

Devi

atio

n o

f A

vera

ge C

lust

ering C

oeffic

ient

com−DBLP

REPRDLJLPRJGPR

50% 60% 70% 80% 90% 100%

−0.4

−0.3

−0.2

−0.1

0

0.1

Percentage of Retained Edges

Devi

atio

n o

f A

vera

ge C

lust

ering C

oeffic

ient

com−Amazon

REPRDLJLPRJGPR

Figure 5: Deviation of average clustering coe�cient at various sparsification thresholds.

In Proc. 43rd Annual ACM Symp. on Theory of Com-puting (STOC), pages 71–80. ACM, 2011.

[11] O. H. Ibarra and C. E. Kim. Fast approximation algo-rithms for the knapsack and sum of subset problems.Journal of the ACM, 22(4):463–468, 1975.

[12] H. Kellerer and U. Pferschy. A new fully polynomialtime approximation scheme for the knapsack problem.Journal of Combinatorial Optimization, 3(1):59–71,1999.

[13] J. M. Kleinberg and Eva Tardos. Algorithm Design.First edition, 2006.

[14] B. Korte and J. Vygen. Combinatorial optimization:Theory and algorithms. Fourth edition, 2008.

[15] A. Krause and D. Golovin. Submodular functionmaximization. Tractability: Practical Approaches toHard Problems, 3:19, 2012.

[16] E. L. Lawler. Fast approximation algorithms for knap-sack problems. Mathematics of Operations Research,4(4):339–356, 1979.

[17] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphsover time: densification laws, shrinking diameters andpossible explanations. In Proc. 11th ACM SIGKDDInt’l. Conf. on Knowledge discovery and data mining(KDD), pages 177–187. ACM, 2005.

[18] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos,J. VanBriesen, and N. Glance. Cost-e↵ective outbreakdetection in networks. In Proc. 13th ACM SIGKDDInt’l. Conf. on Knowledge discovery and data mining(KDD), pages 420–429. ACM, 2007.

[19] G. Lindner, C. L. Staudt, M. Hamann, H. Meyerhenke,and D. Wagne. Structure-preserving sparsification ofsocial networks. In Proc. IEEE/ACM Int’l. Conf.on Advances in Social Networks Analysis and Mining(ASONAM), pages 448–454. IEEE/ACM, 2015.

[20] D. Pisinger. Linear time algorithms for knapsack prob-lems with bounded weights. Journal of Algorithms,33(1):1–14, 1999.

[21] S. Sahni. Approximate algorithms for the 0/1 knapsackproblem. Journal of the ACM, 22(1):115–124, 1975.

[22] V. Satuluri, S. Parthasarathy, and Y. Ruan. Localgraph sparsification for scalable clustering. In Proc.ACM SIGMOD Int’l. Conf. on Management of Data(SIGMOD), pages 721–732. ACM, 2011.

[23] M. A. Serrano, M. Boguna, and A. Vespignani. Ex-tracting the multiscale backbone of complex weightednetworks. Proceedings of the National Academy of Sci-ences (PNAS), 106(16):6483–6488, 2009.

[24] T. C. Sharkey, H. E. Romeijn, and J. Geunes. Aclass of nonlinear nonseparable continuous knapsackand multiple-choice knapsack problems. MathematicalProgramming, 126(1):69–96, 2011.

[25] P. B. Slater. Multiscale network reduction method-ologies: Bistochastic and disparity filtering of humanmigration flows between 3,000+ US counties. arXivpreprint arXiv:0907.2393, 2009.

[26] SNAP Stanford large network dataset collection.http://snap.stanford.edu/data/index.html, lastaccessed Jan 2016.

[27] C. Spearman. The proof and measurement of associ-ation between two things. The American Journal ofPsychology, 15(1):72–101, 1904.

[28] D. A. Spielman and S.-H. Teng. Spectral sparsificationof graphs. SIAM Journal on Computing, 40(4):981–1025, 2011.

[29] B. Wilder and G. Sukthankar. Sparsification of socialnetworks using random walks. In Proc. 8th ASE Int’l.Conf. on Social Computation (SocialCom). ASE, 2015.

[30] J. Yang and J. Leskovec. Defining and evaluating net-work communities based on ground-truth. Knowledgeand Information Systems, 42(1):181–213, 2015.

Page 41: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Trust from the past: Bayesian Personalized Ranking based Link

Prediction in Knowledge Graphs

Baichuan Zhang⇤ Sutanay Choudhury† Mohammad Al Hasan‡ Xia Ning‡

Khushbu Agarwal† Sumit Purohit† Paola Pesantez Cabrera§

Abstract

Link prediction, or predicting the likelihood of a linkin a knowledge graph based on its existing state is akey research task. It di↵ers from a traditional link pre-diction task in that the links in a knowledge graph arecategorized into di↵erent predicates and the link predic-tion performance of di↵erent predicates in a knowledgegraph generally varies widely. In this work, we proposea latent feature embedding based link prediction modelwhich considers the prediction task for each predicatedisjointly. To learn the model parameters it utilizes aBayesian personalized ranking based optimization tech-nique. Experimental results on large-scale knowledgebases such as YAGO2 show that our link prediction ap-proach achieves substantially higher performance thanseveral state-of-art approaches. We also show that for agiven predicate the topological properties of the knowl-edge graph induced by the given predicate edges arekey indicators of the link prediction performance of thatpredicate in the knowledge graph.

1 Introduction

A knowledge graph is a repository of information aboutentities, where entities can be any thing of interest suchas people, location, organization or even scientific top-ics, concepts, etc. An entity is frequently characterizedby its association with other entities. As an example,capturing the knowledge about a company involves list-ing its products, location and key individuals. Simi-larly, knowledge about a person involves her name, dateand place of birth, a�liation with organizations, etc.

⇤Department of Computer and Information Science, In-diana University - Purdue University Indianapolis, USA,[email protected]. The work was conducted during author’s in-ternship at Pacific Northwest National Laboratory

†Pacific Northwest National Laboratory, Richland, WA, USA,{Sutanay.Choudhury,Khushbu.Agarwal,Sumit.Purohit}@pnnl.gov

‡Department of Computer and Information Science, In-diana University - Purdue University Indianapolis, USA,{alhasan,xning}@cs.iupui.edu

§Department of Computer Science, Washington State Univer-sity, Pullman, WA, USA, [email protected]

Resource Description Framework (RDF) is a frequentchoice for capturing the interactions between two enti-ties. A RDF dataset is equivalent to a heterogeneousgraph, where each vertex and edge can belong to di↵er-ent classes. The class information captures taxonomichierarchies between the type of various entities and re-lations. As an example, a knowledge graph may identifyKobe Bryant as a basketball player, while its ontologywill indicate that a basketball player is a particular typeof athlete. Thus, one will be able to query for famousathletes in the United States and find Kobe Bryant.

The past few years have seen a surge in researchon knowledge representations and algorithms for build-ing knowledge graphs. For example, Google KnowledgeVault [6], and IBMWatson [9] are comprehensive knowl-edge bases which are built in order to answer questionsfrom the general population. As evident from theseworks, it requires multitude of e↵orts to build a domainspecific knowledge graph, which are, triple extractionfrom nature language text, entity and relationship map-ping [25], and link prediction [21]. Specifically, triplesextracted from the text data sources using state of theart techniques such as OpenIE [8] and semantic role la-beling [5] are extremely noisy, and simply adding noisytriple facts into knowledge graph destroys its purpose.So computational methods must be devised for decidingwhich of the extracted triples are worthy of insertioninto a knowledge graph. There are several considera-tions for this decision making: (1) trustworthiness ofthe data sources; (2) a belief value reported by a natu-ral language processing engine expressing its confidencein the correctness of parsing; and (3) prior knowledge ofsubjects and objects. This particular work is motivatedby the third factor.

Link prediction in knowledge graph is simply a ma-chine learning approach for utilizing prior knowledge ofsubjects and objects as available in the knowledge graphfor estimating the confidence of a candidate triple. Con-sider the following example: given a social media post “Iwish Tom Cruise was the president of United States”, anatural language processing engine will extract a triple

Page 42: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

(“Tom Cruise”, “president of”, “United States”). Onthe other hand, a web crawler may find the fact that“Tom Cruise is president of Downtown Medical”, result-ing in the triple (“Tom Cruise”, “president of”, “Down-town Medical”). Although we generally do not have anyinformation about the trustworthiness of the sources,our prior knowledge of the entities mentioned in thistriples will enable us to decide that the first of the abovetriples is possibly wrong. Link prediction provides aprinciples approach for such a decision-making. Alsonote that, once we decide to add a triple to the knowl-edge graph, it is important to have a confidence valueassociated with it.

As we use a machine learning approach to computethe confidence of triple facts, it is important that wequantitatively understand the degree of accuracy of ourprediction [28]. It is important, because for the sameknowledge graph the prediction accuracy level variesfrom predicate to predicate. As an example, predict-ing one’s school or workplace can be a much hardertask than predicting one’s liking for a local restaurant.Therefore, given two predicates “worksAt” and “likes”,we expect to see widely varying accuracy levels. Also,the average accuracy levels vary widely from one knowl-edge graph to another. The desire to obtain a quanti-tative grasp on prediction accuracy is complicated bya number of reasons: 1) Knowledge graphs constructedfrom web text or using machine reading approaches canhave a very large number of predicates that make man-ual verification di�cult [6]; 2) Creation of predicates,or the resultant graph structure is strongly shaped bythe ontology, and the conversion process used to gener-ate RDF statements from a logical record in the data.Therefore, same data source can be represented in verydi↵erent models and this leads to di↵erent accuracy lev-els for the same predicate. 3) The e↵ectiveness of knowl-edge graphs have inspired their construction from everyimaginable data source: product inventories (at retail-ers such as Wal-mart), online social networks (such asFacebook), and web pages (Google’s Knowledge Vault).As we move from one data source to another, it is crit-ical to understand what accuracy levels we can expectfrom a given predicate.

In this paper, we use a link prediction 1 approachfor computing the confidence of a triple from the priorknowledge about its subject and object. Many works ex-ist for link prediction [11] in social network analysis [4],but they di↵er from the link prediction in knowledgegraph; for earlier, all the links are semantically similar,but for the latter based on the predicates the semantic

1We use link prediction and link recommendation interchange-ably.

of the links di↵ers widely. So, existing link predictionmethods are not very suitable for this task. We buildour link prediction method by borrowing solutions fromrecommender system research which accept a user-itemmatrix and for a given user-item pair, they return ascore indicating the likelihood of the user purchasingthe item. Likewise, for a given predicate, we considerthe set of subjects and objects as a user-item matrixand produce a real-valued score to measure the confi-dence of the given triple. For training the model we useBayesian personalized ranking (BPR) based embeddingmodel [23], which has been a major work in the rec-ommendation system. In addition, we also study theperformance of our proposed link prediction algorithmin terms of topological properties of knowledge graphand present a linear regression model to reason aboutits expected level of accuracy for each predicate.

Our contributions in this work are outlined below:

1. We implement a Link Prediction approach forestimating confidence for triples in a KnowledgeGraph. Specifically, we borrow from successfulapproaches in the recommender systems domain,adopt the algorithms for knowledge graphs andperform a thorough evaluation on a prominentbenchmark dataset.

2. We propose a Latent Feature Embedding based linkrecommendation model for prediction task and uti-lize Bayesian Personalized Ranking based optimiza-tion technique for learning models for each pred-icate (Section 4). Our experiments on the wellknown YAGO2 knowledge graph (Section 5) showthat the BPR approach outperforms other compet-ing approaches for a significant set of predicates(Figure 1).

3. We apply a linear regression model to quantita-tively analyze the correlation between the predic-tion accuracy for each predicate and the topologicalstructure of the induced subgraph of the originalKnowledge Graph. Our studies show that metricssuch as clustering coe�cient or average degree canbe used to reason about the expected level of pre-diction accuracy (Section 5.3, Figure 2).

2 Related Work

There is a large body of work on link prediction inknowledge graph. In terms of methodology, factoriza-tion based and related latent variable models [3, 7, 13,22, 25], graphical model [14], and graph feature basedmethod [17,18] are considered.

There exists large number of works which focuson factorization based models. The common thread

Page 43: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

among the factorization methods is that they explainthe triples via latent features of entities. [2] presents atensor based model that decomposes each entity andpredicate in knowledge graphs as a low dimensionalvector. However, such a method fails to consider thesymmetry property of the tensor. In order to solve thisissue, [22] proposes a relational latent feature model,RESCAL, an e�cient approach which uses a tensorfactorization model that takes the inherent structure ofrelational data into account. By leveraging relationaldomain knowledge about entity type information, [3]proposes a tensor decomposition approach for relationextraction in knowledge base which is highly e�cientin terms of time complexity. In addition, various otherlatent variable models, such as neural network basedmethods [6, 27], have been explored for link predictiontask. However, the major drawback of neural networkbased models is their complexity and computationalcost in model training and parameter tuning. Many ofthese models require tuning large number of parameters,thus finding the right combination of these parametersis often considered more of an art than science.

Recently graphical models, such as ProbabilisticRelational Models [10], Relational Markov Network [29],Markov Logic Network [14, 24] have also been used forlink prediction in knowledge graph. For instance, [24]proposes a Markov Logic Network (MLN) based ap-proach, which is a template language for defining po-tential functions on knowledge graph by logical formula.Despite its utility for modeling knowledge graph, issuessuch as rule learning di�culty, tractability problem, andparameter estimation pose implementation challenge forMLNs.

Graph feature based approaches assume that theexistence of an edge can be predicted by extractingfeatures from the observed edges in the graph. Lao andCohen [17,18] propose Path Ranking Algorithm (PRA)to perform random walk on the graph and compute theprobability of each path. The main idea of PRA is touse these path probabilities as supervised features foreach entity pair, and use any favorable classificationmodel, such as logistic regression and SVM, to predictthe probability of missing edge between an entity pairin a knowledge graph.

It has been demonstrated [1] that no single ap-proach emerges as a clear winner. Instead, the meritsof factorization models and graph feature models areoften complementary with each other. Thus combin-ing the advantages of di↵erent approaches for learn-ing knowledge graph is a promising option. For in-stance, [20] proposes to use additive model, which isa linear combination between RESCAL and PRA. Thecombination results in not only decrease the training

time but also increase the accuracy. [15] combines a la-tent feature model with an additive term to learn fromlatent and neighborhood-based information on multi-relational data. [6] fuses the outputs of PRA and neuralnetwork model as features for training a binary classi-fier. Our work strongly aligns with this combinationapproach. In this work, we build matrix factorizationbased techniques that have been proved successful forrecommender systems and plan to incorporate graphbased features in future work.

3 Background and Problem Statement

Definition 3.1. We define the knowledge graph as acollection of triple facts G = (S, P,O), where s 2 Sand o 2 O are the set of subject and object entities andp 2 P is the set of predicates or relations between them.G(s, p, o) = 1 if there is a direct link of type p from s too, and G(s, p, o) = 0 otherwise.

Each triple fact in knowledge graph is a statementinterpreted as “A relationship p holds between entitiess and o”. For instance, the statement “Kobe Bryant is aplayer of LA Lakers” can be expressed by the followingtriple fact (“Kobe Bryant”, “playsFor”, “LA Lakers”).

Definition 3.2. For each relation p 2 P , we defineGp(Sp, Op) as a bipartite subgraph of G, where thecorresponding set of entities sp 2 Sp, op 2 Op areconnected by relation p, namely Gp(sp, op) = 1.

Problem Statement: For every predicate p 2 P andgiven an entity pair (s, o) in Gp, our goal is to learna link recommendation model Mp such that xs,o =Mp(s, o) is a real-valued score.

Due to the fact that the produced real-valued scoreis not normalized, we compute the probability Pr(yps,o =1), where yps,o is a binary random variable that is truei↵ Gp(s, o) = 1. We estimate this probability Pr usingthe logistic function as follows:

(3.1) Pr(yps,o = 1) =1

1 + exp(�xs,o)

Thus we interpret Pr(yps,o = 1) as the probabilitythat a vertex (or subject) s in the knowledge graph Gis in a relationship of given type p with another vertex(or the object) o.

4 Methods

In this section, we describe our model, namely La-tent Feature Embedding Model with Bayesian Personal-ized Ranking (BPR) based optimization technique thatwe propose for the task of link prediction in a knowl-edge graph. In our link prediction setting, for a givenpredicate p, we first construct its bipartite subgraph

Page 44: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Gp(Sp, Op). Then we learn the optimal low dimen-sional embeddings for its corresponding subject and ob-ject entities sp 2 Sp, op 2 Op by maximizing a rankingbased distance function. The learning process relies onStochastic Gradient Descent (SGD). The SGD basedoptimization technique iteratively updates the low di-mensional representation of sp and op until convergence.Then the learned model is used for ranking the unob-served triple facts in descending order such that triplefacts with higher score values have a higher probabilityof being correct.

4.1 Latent Feature Based Embedding ModelFor each predicate p, the model maps both its corre-sponding subject and object entites sp and op into low-dimensional continuous vector spaces, say Up

s 2 IR1⇥K

and V po 2 IR1⇥K respectively. We measure the compati-

bility between subject sp and object op as dot product ofits corresponding latent vectors which is given as below:

(4.2) xsp

,op

= (Ups )(V

po )

T + bpo

where Up 2 IR|S|⇥K , V p 2 IR|O|⇥K , and bp 2IR|O|⇥1. |S| and |O| denote the size of subject andobject associated with predicate p respectively. K isthe number of latent dimensions and bpo 2 IR is a biasterm associated with object o. Given predicate p, thehigher the score of xs

p

,op

, the more similar the entitiessp and op in the embedded low dimensional space, andthe higher the confidence to include this triple fact intoknowledge base.

4.2 Bayesian Personalized RankingIn collaborative filtering, positive-only data is knownas implicit feedback/binary feedback. For example,in the eCommerce platform, some users only buy butdo not rate items. Motivated by [23], we employBayesian Personalized Ranking (BPR) based approachfor model learning. Specifically, in recommender systemdomain, given user-item matrix, BPR based approachassigns the preference of user for purchased item withhigher score than un-purchased item. Likewise, underthis context, we assign observed triple facts higherscore than unobserved triple facts in knowledge base.We assume that unobserved facts are not necessarilynegative, rather they are “less preferable” than theobserved ones.

For our task, in each predicate p, we denote theobserved subject/object entity pair as (sp, o+p ) andunobserved one as (sp, o�p ). The observed facts in ourcase are the existing link between sp and op given Gp

and unobserved ones are the missing link between them.Given this fact, BPR maximizes the following ranking

based distance function:

(4.3)BPR = max

⇥p

P(s

p

,o+p

,o�p

)2Dp

ln�(xsp

,o+p

� xsp

,o�p

)� �⇥p

|| ⇥p ||2

where Dp is a set of samples generated from thetraining data for predicate p, Gp(sp, o+p ) = 1 andGp(sp, o�p ) = 0. And xs

p

,o+p

and xsp

,o�p

are the predicted

scores of subject sp on objects o+p and o�p respectively.We use the proposed latent feature based embeddingmodel shown in Equation 4.2 to compute xs

p

,o+p

andxs

p

,o�p

respectively. The last term in Equation 4.3 is al2-norm regularization term used for model parameters⇥p = {Up, V p, bp} to avoid overfitting in the learningprocess. In addition, the logistic function �(.) inEquation 4.3 is defined as �(x) = 1

1+e�x

.Notice that the Equation 4.3 is di↵erentiable, thus

we employ the widely used SGD to maximize the objec-tive. In particular, at each iteration, for given predicatep, we sample one observed entity pair (sp, o+p ) and oneunobserved one (sp, o�p ) using uniform sampling tech-nique. Then we iteratively update the model param-eters ⇥p based on the sampled pairs. Specifically, foreach training instance, we compute the derivative andupdate the corresponding parameters ⇥p by walkingalong the ascending gradient direction.

For each predicate p, given a training triple(sp, o+p , o

�p ), the gradient of BPR objective in Equa-

tion 4.3 with respect to Ups , V

po+ , V

po� , b

po+ , b

po� can be

computed as follows:

@BPR

@Ups

=@ ln�(xs

p

,o+p

� xsp

,o�p

)

@Ups

� 2�psU

ps

=@ ln�(x

s

p

,o

+p

�xs

p

,o

�p

)

@�(xs

p

,o

+p

�xs

p

,o

�p

) ⇥@�(x

s

p

,o

+p

�xs

p

,o

�p

)

@(xs

p

,o

+p

�xs

p

,o

�p

)

⇥@(x

s

p

,o

+p

�xs

p

,o

�p

)

@Up

s

� 2�psU

ps

=1

�(xsp

,o+p

� xsp

,o�p

)⇥ �(xs

p

,o+p

� xsp

,o�p

)

�1� �(xs

p

,o+p

� xsp

,o�p

)�⇥ (V p

o+ � V po�)� 2�p

sUps

=�1� �(xs

p

,o+p

� xsp

,o�p

)�(V p

o+ � V po�)� 2�p

sUps

(4.4)

We obtain the following using similar chain rulederivation.

(4.5)@BPR

@V po+

=�1��(xs

p

,o+p

�xsp

,o�p

)�⇥Up

s �2�po+V

po+

(4.6)@BPR

@V po�

=�1� �(xs

p

,o+p

� xsp

,o�p

)�⇥ (�Up

s )� 2�po�V

po�

Page 45: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

(4.7)@BPR

@bpo+=

�1��(xs

p

,o+p

�xsp

,o�p

)�⇥ 1� 2�p

o+bpo+

(4.8)@BPR

@bpo�=

�1� �(xs

p

,o+p

� xsp

,o�p

)�⇥ (�1)� 2�p

o�bpo�

Next, the parameters are updated as follows:

(4.9) Ups = Up

s + ↵⇥ @BPR

@Ups

(4.10) V po+ = V p

o+ + ↵⇥ @BPR

@V po+

(4.11) V po� = V p

o� + ↵⇥ @BPR

@V po�

(4.12) bpo+ = bpo+ + ↵⇥ @BPR

@bpo+

(4.13) bpo� = bpo� + ↵⇥ @BPR

@bpo�

where ↵ is the learning rate.

4.3 Pseudo-code and Complexity AnalysisThe pseudo-code of our proposed link prediction modelis described in Algorithm 1. It takes the knowledgegraph G and a specific target predicate p as input andgenerates the low dimensional latent matrices Up, V p,bp as output. Line 1 constucts the bipartite subgraph ofpredicate p, Gp given entire knowledge graph G. Line 2-3 compute the number of subject and object entities asm and n in resultant bipartite subgraph Gp respectively.Line 4 generates a collection of triple samples usinguniform sampling technique. Line 5-7 initialize thematrices Up, V p, bp using Gaussian distribution with0 mean and 0.1 standard deviation, assuming all theentries in Up, V p and bp are independent. Line 8-14update corresponding rows of matrices Up, V p, bp basedon the sampled instance (sp, o+p , o

�p ) in each iteration.

As the sample generation step in line 4 is prior to themodel parameter learning, thus the convergence criteriaof Algorithm 1 is to iterate over all the sampled triplesin Dp.

Given the constructed Gp as input, the timecomplexity of the update rules shown in Equa-tions 4.9 4.10 4.11 4.12 4.13 is O(cK), where K isthe number of latent features. The total computationalcomplexity of Algorithm 1 is then O(|Dp| · cK), where|Dp| is the total size of pre-sampled triples shown in line4 of Algorithm 1.

Algorithm 1 Bayesian Personalized Ranking BasedLatent Feature Embedding Model

Input: latent dimension K, G, target predicate pOutput: Up, V p, bp

1: Given target predicate p and entire knowledge graphG, construct its bipartite subgraph, Gp

2: m = number of subject entities in Gp

3: n = number of object entities in Gp

4: Generate a set of training samples Dp ={(sp, o+p , o�p )} using uniform sampling technique

5: Initialize Up as size m⇥K matrix with 0 mean andstandard deviation 0.1

6: Initialize V p as size n⇥K matrix with 0 mean andstardard deviation 0.1

7: Initialize bp as size n⇥1 column vector with 0 meanand stardard deviation 0.1

8: for all (sp, o+p , o�p ) 2 Dp do

9: Update Ups based on Equation 4.9

10: Update V po+ based on Equation 4.10

11: Update V po� based on Equation 4.11

12: Update bpo+ based on Equation 4.1213: Update bpo� based on Equation 4.1314: end for15: return Up, V p, bp

5 Experiments and Results

This section presents our experimental analysis of theAlgorithm 1 for thirteen unique predicates in the wellknown YAGO2 knowledge graph [12]. We construct amodel for each predicate and describe our evaluationstrategies, including performance metrics and selectionof state-of-the-art methods for benchmarking in section5.1. We aim to answer two questions through ourexperiments:

1. How does our approach compare with related workfor link recommendation in knowledge graph?

2. For a predicate p, can we reason about the linkprediction model performance Mp in terms of thestructural metrics of the bipartite graph Gp?

Table 1 shows the statistic of various YAGO2relations used in our experiments. # Subjects and #Objects represent the number of subject and objectentities associated with its corresponding predicate.The last column shown in Table 1 shows the numberof facts for each relation in YAGO2. We run all theexperiments on a 2.1 GHz Machine with 4GB memoryrunning Linux operating system. The algorithms areimplemented in Python language along with NumPyand SciPy libraries for linear algebra operations. The

Page 46: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

(a) HR Comparison among di↵erent linkrecommendation methods

(b) ARHR Comparison among di↵erentlink recommendation methods

(c) AUC Comparison among di↵erent linkrecommendation methods

Figure 1: Link Recommendation Comparison on YAGO2 Relations

(a) Graph Density and HR (b) Graph Density and ARHR (c) Graph Density and AUC

(d) Graph Average Degree and HR (e) Graph Average Degree and ARHR (f) Graph Average Degree and AUC

(g) Clustering Coe�cient and HR (h) Clustering Coe�cient and ARHR (i) Clustering Coe�cient and AUC

Figure 2: Quantitative Analysis Between Graph Topology and Link Recommendation Model Performance

Page 47: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

Relation # Subjects # Objects # of Facts in YAGO2

Import 142 62 391Export 140 176 579

isInterestedIn 358 213 464hasO�cialLanguage 583 214 964

dealsWith 131 124 945happenedIn 7121 5526 12500

participatedIn 2330 7043 16809isConnectedTo 2835 4391 33581

hasChild 10758 12800 17320influence 8056 9153 25819

wroteMusicFor 5109 21487 24271edited 549 5673 5946owns 8330 24422 26536

Table 1: Statistics of Various Relations in YAGO2Dataset

software is available online for download 2.

5.1 Experimental SettingFor our experiment, in order to demonstrate the per-formance of our proposed link prediction model, we usethe YAGO2 dataset and several evaluation metrics forall compared algorithms. Particularly, for each relation,we split the data into a training part, used for modeltraining, and a test part, used for model evaluation. Weapply 5-time leave one out evaluation strategy, where foreach subject, we randomly remove one fact (one subject-object pair) and place it into test set Stest and remain-ing in the training set Strain. For every subject, thetraining model will generate a size-N ranked list of rec-ommended objects for recommendation task. The eval-uation is conducted by comparing the recommendationlist of each subject and the object entity of that subjectin the test set. Grid search is applied to find regular-ization parameters, and we set the values of parametersused in section 4.2 as �s = �o+ = �o� = 0.005. Forother model parameters, we fix learning rate ↵ = 0.2,and number of latent factors K = 50 respectively. Forparameter in model evaluation, we set N = 10.

In order to illustrate the merit of our proposedapproach, we compare our model with the followingmethods for link prediction in a knowledge graph. Sincethe problem we solve in this paper is similar to the one-class item recommendation [23] in recommender systemdomain, we consider the following state-of-the-art one-class recommendation methods as baseline approachesfor comparison.

1. Random (Rand): For each relation, this methodrandomly selects subject-object entity pair for linkrecommendation task.

2https://sites.google.com/site/baichuanzhangpurdue/

sample-code-and-dataset

2. Most Popular (MP): For each predicate inknowledge base, this method presents a non-personalized ranked object list based on how oftenobject entities are connected among all subject en-tities.

3. MF: The matrix factorization method is proposedby [16], which uses a point-wise strategy for solvingthe one-class item recommendation problem.

During the model evaluation stage, we use threepopular metrics, namely Hit Rate (HR), Average Recip-rocal Hit-Rank (ARHR), and Area Under Curve (AUC),to measure the link recommendation quality of our pro-posed approach in comparison to baseline methods. HRis defined as follows:

(5.14) HR =#hits

#subjects

where #subjects is the total number of subject en-tities in test set, and #hits is the number of subjectswhose object entity in the test set is recommended in thesize-N recommendation list. The second evaluation met-ric, ARHR, considering the ranking of the recommendedobject for each subject entity in knowledge graph, is de-fined as below:

(5.15) ARHR =1

#subjects

#hitsX

i=1

1

pi

where if an object of a subject is recommended forconnection in knowledge graph which we name as hitunder this scenario, pi is the position of the object in theranked recommendation list. As we can see, ARHR is aweighted version of HR and it captures the importanceof recommended object in the recommendation list.

The last metric, AUC is defined as follows:

(5.16)AUC = 1

#subjects

Ps2subjects

1|E(s)|

P(o+,o�)2E(s) �(xs,o+ > xs,o�)

Where E(s) = {(o+, o�)|(s, o+) 2 Stest \ (s, o�) 62(Stest [ Strain)}, and �() is the indicator function.

For all of three metrics, higher values indicate bettermodel performance. Specifically, the trivial AUC of arandom predictor is 0.5 and the best value of AUC is 1.

5.2 YAGO2 Relation Prediction Performance

Figure 1 shows the average link prediction per-formance for YAGO2 relations using various meth-ods. Our proposed latent feature embedding approach

Page 48: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

shows overall improvement compared with other algo-rithms on most of relations in YAGO2. For instance,for all the YAGO2 predicates used in the experiment,our proposed model consistently outperforms MF basedmethod, which demonstrates the empirical experiencethat pairwise ranking based method achieves muchbetter performance than pointwise regression basedmethod given implicit feedback for link recommenda-tion task. Compared with Popularity based recommen-dation method MP, our method obtains better perfor-mance for most predicates. For example, predicatessuch as “participate”,“connect”,“hasChild”, and “influ-ence”, our proposed model achieves more than 10 timesbetter performance in terms of both HR and ARHR.However, for several predicates such as “import”, “ex-port”, and “language”, MP based method performs thebest among all the competing methods. The good per-formance of MP is owing to the semantic meaning ofspecific predicate. For instance, “import” representsCountry/Product relation in YAGO2, which indicatesthe types of its subject and object entities are geo-graphic region and commodity respectively. For sucha predicate, most popular object entities such as food,cloth, fuel are linked to most of the countries, whichhelps MP based method obtain good link recommenda-tion performance.

5.3 Analysis and Discussion

Figure 1 shows that the link prediction model per-formance widely varies from predicate to predicate inthe YAGO2 knowledge base. For example, the HRof predicate “dealsWith” is significantly better than“own”. Thus it is critical that we quantitatively under-stand the model performance across various relations ina knowledge graph. Recall from the Problem State-ment that given a predicate p, our model Mp only ac-counts for the bipartite subgraphGp. Motivated by [19],we study the impact of resultant graph structure of Gp

on the performance of Mp.For each predicate p, we compute several graph

topology metrics on its bipartite subgraph Gp such asgraph density, graph average degree, and clustering co-e�cient. Figure 2 shows the quantitative analysis be-tween graph structure and link prediction model per-formance of each predicate. In each subfigure, x-axisrepresents the computed graph topology metric valueof each predicate and y-axis denotes our proposed linkprediction model performance in terms of HR, ARHR,and AUC. Each cross point shown in blue representsone specific YAGO2 predicate used in our experiments.Then we developed a linear regression model to under-stand the correlation between link prediction model per-

formance and each graph metric. For each linear regres-sion curve shown in red color, we also report its slope,intercept, and correlation coe�cient (rvalue) to capturethe association trend.

From Figure 2, both graph density and graph aver-age degree show strong positive correlation signal withproposed link prediction model as demonstrated byrvalue. As our approach is inspired by collaborativefiltering for recommender systems that accept a user-item matrix as input, for resultant graph of each predi-cate, higher graph density indicates higher matrix den-sity in user-item matrix, which naturally leads to betterrecommendation performance in recommender systemdomain. Similar explanation can be adapted to graphaverage degree. For the clustering coe�cient, it showsstrong negative correlation signal with link predictionmodel performance. For instance, in terms of AUC, thervalue is around �0.69. As clustering coe�cient (cc)is the number of closed triples over the total numberof triples in graph, smaller value of cc indicates lowerfraction of closed triples in the graph. Based on thetransitivity property of a social graph, which states thefriends of your friend have high likelihood to be friendsthemselves [26, 30], it is relatively easier for link pre-diction model to predict (i.e.,hit) such link with opentriple property in the graph, which leads to better linkprediction performance.

6 Conclusion and Future Work

Inspired by the success of collaborative filtering algo-rithms for recommender systems, we propose a latentfeature based embedding model for the task of link pre-diction in a knowledge graph. Our proposed methodprovides a measure of “confidence” for adding a tripleinto the knowledge graph. We evaluate our implementa-tion on the well known YAGO2 knowledge graph. Theexperiments show that our Bayesian Personalized Rank-ing based latent feature embedding approach achievesbetter performance compared with two state-of-art rec-ommender system models: Most Popular and MatrixFactorization. We also develop a linear regression modelto quantitatively study the correlation between the per-formance of link prediction model itself and varioustopological metrics of the graph from which the modelsare constructed. The regression analysis shows strongcorrelation between the link prediction performance andgraph topological features, such as graph density, aver-age degree and clustering coe�cient.

For a given predicate, we build link prediction mod-els solely based on the bipartite subgraph of the origi-nal knowledge graph. However, as real-world experiencesuggests, the existence of a relation between two entitiescan also be predicted from the presence of other rela-

Page 49: SDM-Networks 2016 The Third SDM Workshop on …SDM-Networks 2016 The Third SDM Workshop on Mining Networks and Graphs: A Big Data Analytic Challenge May 7, 2016, Miami, Florida, USA

tions, either direct or through common neighbors. Asan example, the knowledge of where someone studiesand who they are friends with is useful to predict possi-ble workplaces. Incorporating such intuition as “socialsignals” into our current model will be the prime can-didate for an immediate future work. Another futurework would be to update the knowledge graph basedon the newer facts that become available over time instreaming data sources.

Acknowledgement

This work was supported by the Analysis In MotionInitiative at Pacific Northwest National Laboratory,which is operated by Battelle Memorial Institute, andby Mohammad Al Hasan’s NSF CAREER Award (IIS-1149851).

References

[1] A. Bordes and E. Gabrilovich. Constructing andmining web-scale knowledge graphs. In SIGKDD 14.

[2] R. Bro. Parafac. tutorial and applications. Chemomet-rics and Intelligent Laboratory Systems, 1997.

[3] K.-W. Chang, W. tau Yih, B. Yang, and C. Meek.Typed tensor decomposition of knowledge bases forrelation extraction. ACL, 2014.

[4] P. Chen and A. O. H. III. Deep community detection.IEEE Transactions on Signal Processing, 2015.

[5] R. Collobert, J. Weston, L. Bottou, M. Karlen,K. Kavukcuoglu, and P. Kuksa. Natural language pro-cessing (almost) from scratch. JMLR, 2011.

[6] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao,K. Murphy, T. Strohmann, S. Sun, and W. Zhang.Knowledge vault: A web-scale approach to probabilis-tic knowledge fusion. In SIGKDD, 2014.

[7] L. Drumond, S. Rendle, and L. Schmidt-Thieme. Pre-dicting rdf triples in incomplete knowledge bases withtensor factorization. In ACM SAC, 2012.

[8] O. Etzioni, M. Banko, S. Soderland, and D. S. Weld.Open information extraction from the web. Commu-nications of the ACM, pages 68–74, 2008.

[9] D. A. Ferrucci, E. W. Brown, J. Chu-Carroll, J. Fan,D. Gondek, A. Kalyanpur, A. Lally, J. W. Murdock,E. Nyberg, J. M. Prager, N. Schlaefer, and C. A. Welty.Building watson: An overview of the deepqa project.AI Magazine, 2010.

[10] N. Friedman, L. Getoor, D. Koller, and A. Pfe↵er.Learning probabilistic relational models. In IJCAI 99.

[11] M. A. Hasan, V. Chaoji, S. Salem, and M. Zaki. Linkprediction using supervised learning. In In Proc. ofSDM 06 workshop on Link Analysis, Counterterrorismand Security, 2006.

[12] J. Ho↵art, F. M. Suchanek, K. Berberich, E. Lewis-Kelham, G. de Melo, and G. Weikum. Yago2: Ex-ploring and querying world knowledge in time, space,context, and many languages. In WWW, 2011.

[13] R. Jenatton, N. L. Roux, A. Bordes, and G. R.Obozinski. A latent factor model for highly multi-relational data. In NIPS. 2012.

[14] S. Jiang, D. Lowd, and D. Dou. Learning to refine anautomatically extracted knowledge base using markovlogic. In ICDM, pages 912–917, 2012.

[15] X. Jiang, V. Tresp, Y. Huang, and M. Nickel. Linkprediction in multi-relational graphs using additivemodels. In SeRSy, pages 1–12, 2012.

[16] Y. Koren, R. Bell, and C. Volinsky. Matrix factoriza-tion techniques for recommender systems. Journal ofComputer Science, pages 30–37, 2009.

[17] N. Lao and W. W. Cohen. Relational retrieval using acombination of path-constrained random walks. Jour-nal of Machine learning, pages 53–67, 2010.

[18] N. Lao, T. Mitchell, and W. W. Cohen. Random walkinference and learning in a large scale knowledge base.In EMNLP, pages 529–539, 2011.

[19] D. Liben-Nowell and J. Kleinberg. The link predictionproblem for social networks. CIKM, 2003.

[20] M. Nickel, X. Jiang, and V. Tresp. Reducing therank in relational factorization models by includingobservable patterns. In NIPS, pages 1179–1187, 2014.

[21] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich.A review of relational machine learning for knowledgegraphs: From multi-relational link prediction to auto-mated knowledge graph construction. arXiv preprintarXiv:1503.00759, 2015.

[22] M. Nickel, V. Tresp, and H. peter Kriegel. A three-waymodel for collective learning on multi-relational data.In ICML, pages 809–816, 2011.

[23] S. Rendle, C. Freudenthaler, Z. Gantner, andL. Schmidt-Thieme. Bpr: Bayesian personalized rank-ing from implicit feedback. In UAI, 2009.

[24] M. Richardson and P. Domingos. Markov logic net-works. Machine learning, pages 107–136, 2006.

[25] S. Riedel, L. Yao, A. McCallum, and B. M. Marlin.Relation extraction with matrix factorization and uni-versal schemas. In HLT-NAACL, 2013.

[26] T. K. Saha, B. Zhang, and M. Al Hasan. Name dis-ambiguation from link data in a collaboration graphusing temporal and topological features. Social Net-work Analysis and Mining, pages 1–14, 2015.

[27] R. Socher, D. Chen, C. D. Manning, and A. Ng.Reasoning with neural tensor networks for knowledgebase completion. In NIPS, pages 926–934, 2013.

[28] C. H. Tan, E. Agichtein, P. Ipeirotis, andE. Gabrilovich. Trust, but verify: Predictingcontribution quality for knowledge base constructionand curation. In WSDM, pages 553–562, 2014.

[29] B. Taskar, P. Abbeel, and D. Koller. Discriminativeprobabilistic models for relational data. In UAI, 2002.

[30] B. Zhang, T. K. Saha, and M. Al Hasan. Namedisambiguation from link data in a collaboration graph.In ASONAM, 2014.