Personalized PageRank based community detection

Personalized !PageRank based

Community Detection

David F. Gleich!Purdue University!

Joint work with C. Seshadhri, Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by NSF CAREER 1149756-CCF

Code bit.ly/dgleich-codes!

Today’s talk

1.  Personalized PageRank"based community detection

2.  Conductance, Egonets, and "Network Community Profiles

3.  Egonet seeding 4.  Improved seeding

David Gleich · Purdue 2 MLG2013

A community is a set of vertices that is denser inside than out.


250 node GEOP network in 2 dimensions 4

250 node GEOP network in 2 dimensions 5

We can find communities using Personalized PageRank (PPR) [Andersen et al. 2006] PPR is a Markov chain on nodes 1.  with probability 𝛼, ", "

follow a random edge 2.  with probability 1-𝛼, ", "

restart at a seed

aka random surfer aka random walk with restart unique stationary distribution


Personalized PageRank community detection 1.  Given a seed, approximate the

stationary distribution. 2.  Extract the community.

Both are local operations.


Demo!


Conductance communities Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community

�(S) =

cut(S)

min

�vol(S), vol(

¯S)

�(edges leaving the set)

(total edges in the set)

David Gleich · Purdue

cut(S) = 7

vol(S) = 33

vol(

¯S) = 11

�(S) = 7/11

9 MLG2013

Andersen-Chung-Lang!personalized PageRank community theorem![Andersen et al. 2006]"

Informally Suppose the seeds are in a set of good conductance, then the personalized PageRank method will find a set with conductance that’s nearly as good. … also, it’s really fast.

David Gleich · Purdue 10

MLG2013

# G is graph as dictionary-of-sets !alpha=0.99 !tol=1e-4 !!x = {} # Store x, r as dictionaries !r = {} # initialize residual !Q = collections.deque() # initialize queue !for s in seed: ! r(s) = 1/len(seed) ! Q.append(s) !while len(Q) > 0: ! v = Q.popleft() # v has r[v] > tol*deg(v) ! if v not in x: x[v] = 0. ! x[v] += (1-alpha)*r[v] ! mass = alpha*r[v]/(2*len(G[v])) ! for u in G[v]: # for neighbors of u ! if u not in r: r[u] = 0. ! if r[u] < len(G[u])*tol and \ ! r[u] + mass >= len(G[u])*tol: ! Q.append(u) # add u to queue if large ! r[u] = r[u] + mass ! r[v] = mass*len(G[v]) !


MLG2013

Demo 2!


MLG2013

Problem 1, which seeds?


MLG2013

Problem 2, not fast enough.


MLG2013

Gleich-Seshadhri, KDD 2012!


Neighborhoods are good communities

15

MLG2013

Gleich-Seshadhri, KDD 2012!Egonets and Conductance


Neighborhoods are good communities ^

conductance

^

Vertex

Egonets? … in graphs that look like social and information networks

16

MLG2013

Vertex neighborhoods or!Egonets The induced subgraph of"set a vertex its neighbors Prior research on egonets of social networks from the “structural holes” perspective [Burt95,Kleinberg08]. Used for anomaly detection [Akoglu10], "community seeds [Huang11,Schaeffer11], "overlapping communities [Schaeffer07,Rees10].


MLG2013

Simple version of theorem

If global clustering coefficient = 1, then "the graph is a disjoint union of cliques. Vertex neighborhoods are optimal communities!


MLG2013

Theorem Condition Let graph G have clustering coefficient 𝜅 and "have vertex degrees bounded "by a power-law function with exponent 𝛾 less than 3. Theorem Then there exists a vertex neighborhood with conductance

log degree

log

prob

abilit

y ↵1n/d�

↵2n/d�

4(1 � )/(3 � 2)


MLG2013

Confession!The theory is weak

�(S) 4(1 � )/(3 � 2)

Collaboration networks "𝜅 ~ [0.1 – 0.5]

Social networks "𝜅 ~ [0.05 – 0.1]

Graph Verts Edges Avg.

Deg.

Max

Deg.

¯C

ca-AstroPh 17903 196972 22.0 504 0.318 0.633

email-Enron 33696 180811 10.7 1383 0.085 0.509

cond-mat-2005 36458 171735 9.4 278 0.243 0.657

arxiv 86376 517563 12.0 1253 0.560 0.678

dblp 226413 716460 6.3 238 0.383 0.635

hollywood-2009 1069126 56306653 105.3 11467 0.310 0.766

fb-Penn94 41536 1362220 65.6 4410 0.098 0.212

fb-A-oneyear 1138557 4404989 7.7 695 0.038 0.060

fb-A 3097165 23667394 15.3 4915 0.048 0.097

soc-LiveJournal1 4843953 42845684 17.7 20333 0.118 0.274

oregon2-010526 11461 32730 5.7 2432 0.037 0.352

p2p-Gnutella25 22663 54693 4.8 66 0.005 0.005

as-22july06 22963 48436 4.2 2390 0.011 0.230

itdk0304 190914 607610 6.4 1071 0.061 0.158

Graph Verts Edges Avg.

Deg.

Max

Deg.

¯C

ca-AstroPh 17903 196972 22.0 504 0.318 0.633

email-Enron 33696 180811 10.7 1383 0.085 0.509

cond-mat-2005 36458 171735 9.4 278 0.243 0.657

arxiv 86376 517563 12.0 1253 0.560 0.678

dblp 226413 716460 6.3 238 0.383 0.635

hollywood-2009 1069126 56306653 105.3 11467 0.310 0.766

fb-Penn94 41536 1362220 65.6 4410 0.098 0.212

fb-A-oneyear 1138557 4404989 7.7 695 0.038 0.060

fb-A 3097165 23667394 15.3 4915 0.048 0.097

soc-LiveJournal1 4843953 42845684 17.7 20333 0.118 0.274

oregon2-010526 11461 32730 5.7 2432 0.037 0.352

p2p-Gnutella25 22663 54693 4.8 66 0.005 0.005

as-22july06 22963 48436 4.2 2390 0.011 0.230

itdk0304 190914 607610 6.4 1071 0.061 0.158

Tech. networks "𝜅 ~ [0.005 – 0.05]

This bound is useless unless 𝜅 ≥ 1/2


MLG2013

We view this theory as "“intuition for the truth”


MLG2013

Empirical Evaluation using Network Community Profiles

large global clustering coe�cients and large mean clusteringcoe�cients.

Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).

Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.

Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].

6. EMPIRICAL NEIGHBORHOODCOMMUNITIES

6.1 ComputationWe first show that we can adapt any procedure to compute

all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:

edges(N1

(v) \ {v})/2

because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N

1

(v) \ {v})/2 = edges(N1

(v))/2 � dv. Thencut(N

1

(v)) = vol(N1

(v)) � edges(N1

(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.

6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to

show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.

web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

soc-LiveJournal1 ca-AstroPh

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

Number of vertices in cluster Number of vertices in cluster

Figure 2: The best neighborhood community con-

ductance at each size (black) and the Fiedler com-

munity (red). (Note the axis limits on ca-AstroPh).

First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.

The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.

6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed

by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,

Community Size

Minimum conductance for

any community of the given size

Approximate canonical shape found by Leskovec, Lang, Dasgupta, and Mahoney Holds for a variety of approximations to conductance.


MLG2013

Empirical Evaluation using Network Community Profiles








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









Community Size"(Degree + 1)

Minimum conductance for any community

neighborhood of the given size

“Egonet community profile” shows the same shape, 3 secs to compute.

1.1M verts, 4M edges

The Fiedler community computed from the normalized Laplacian is a neighborhood!


Facebook data from Wilson et al. 2009

23

MLG2013

Not just one graph








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2
















edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









arXiv – 86k verts, 500k edges soc-LiveJournal – 5M verts, 42M edges

15 more graphs available www.cs.purdue.edu/~dgleich/codes/neighborhoods David Gleich · Purdue 24

MLG2013

Filling in the !Network Community Profile








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









Minimum conductance for any community

neighborhood of the given size

We are missing a region of the NCP when we just look at neighborhoods


Community Size"(Degree + 1)

25

MLG2013

Facebook Sample - 1.1M verts, 4M edges

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

Filling in the !Network Community Profile








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









Minimum conductance for

any community of the given size

7807 seconds

This region fills when using the PPR method (like now!)


Community Size"

26

MLG2013


Am I a good seed?!Locally Minimal Communities

“My conductance is the best locally.”

�(N(v )) �(N(w))

for all w adjacent to v

In Zachary’s Karate Club network, there are four locally minimal communities, the two leaders and two peripheral nodes.


MLG2013

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

Locally minimal communities capture extremal neighborhoods








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









Red dots are conductance "and size of a "

locally minimal community

Usually about 1%

of # of vertices.

The red circles – the best local mins – find the extremes in the egonet profile.


Community Size"

28

MLG2013


100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

Filling in the NCP!Growing locally minimal comm.








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









PPR growing only locally min

communities, seeded from entire egonet

3 seconds

283 seconds 7807 seconds

Full NCP Locally min NCP

Original Egonet


Community Size"

29

MLG2013

But there’s a small problem. Most people want to cover a network with communities! We just looked at the best.


MLG2013

The coverage of egonet-grown communities is really bad.


10−3 10−2 10−1 100

10−2

10−1

100

0.7%q=0.10

2.2%q=0.25

29.8%q=0.88

10−3 10−2 10−1 100

10−2

10−1

100

0.7%q=0.10

2.2%q=0.25

29.8%q=0.88

Coverage

Max

Con

duct

ance

Facebook network With a conductance of 0.1 (not so good) we only cover 1% of the vertices in the network.

Log-Log scale! 31

MLG2013

Whang-Gleich-Dhillon, CIKM2013 [upcoming…]

1.  Extract part of the graph that might have overlapping communities.

2.  Compute a partitioning of the network into many pieces (think sqrt(n)) using Graclus.

3.  Find the center of these partitions. 4.  Use PPR to grow egonets of these centers.


MLG2013

Table 4: Returned number of clusters and graph coverage of each algorithm

Graph random egonet graclus ctr. spread hubs demon bigclam

HepPh coverage (%) 97.1 72.1 100 100 88.8 62.1no. of clusters 97 241 109 100 5,138 100

AstroPh coverage (%) 97.6 71.1 100 100 94.2 62.3no. of clusters 192 282 256 212 8,282 200

CondMat coverage (%) 92.4 99.5 100 100 91.2 79.5no. of clusters 199 687 257 202 10,547 200

DBLP coverage (%) 99.9 86.3 100 100 84.9 94.6no. of clusters 21,272 8,643 18,477 26,503 174,627 25000

Amazon coverage (%) 99.9 100 100 100 79.2 99.2no. of clusters 21,553 14,919 20,036 27,763 105,828 25,000

Flickr coverage (%) 76.0 54.0 100 93.6 - 52.1no. of clusters 14,638 24,150 16,347 15,349 - 15,000

LiveJournal coverage (%) 88.9 66.7 99.8 99.8 - 43.9no. of clusters 14,850 34,389 16,271 15,058 - 15,000

Myspace coverage (%) 91.4 69.1 100 99.9 - -no. of clusters 14,909 67,126 16,366 15,324 - -

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Coverage (percentage)

Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandomdemonbigclam

Student Version of MATLAB

(a) AstroPh

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Max

imum

Con

duct

ance



(b) HepPh

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Max

imum

Con

duct

ance



(c) CondMat

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandombigclam


(d) Flickr

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandombigclam


(e) LiveJournal

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandom


(f) Myspace

Figure 2: Conductance vs. graph coverage – lower curve indicates better communities. Overall, “gracluscenters” outperforms other seeding strategies, including the state-of-the-art methods Demon and Bigclam.

Flickr social network

2M vertices"22M edges

We can cover

95% of network with communities

of cond. ~0.15.


A good partitioning helps!

33

MLG2013

flickr sample - 2M verts, 22M edges

F1 F20.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24DBLP

demonbigclamgraclus centersspread hubsrandomegonet


F1 F2

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Amazon



Figure 3: F1 and F2 measures comparing our algorithmic communities to ground truth – a higher barindicates better communities.

Amazon DBLP0

1

2

3

4

5

6

7

8Run time

Run

tim

e (h

ours

)



Figure 4: Runtime on Amazon and DBLP – Theseed set expansion algorithm is faster than Demonand Bigclam.

5.4 Comparison of running timesFinally, we compare the algorithms by runtime. Figure 4

and Table 5 show the runtime of each algorithm. We runthe single thread version of Bigclam for HepPh, AstroPh,CondMat, DBLP, and Amazon networks, and use the multi-threaded version with 20 threads for Flickr, Myspace, andLiveJournal networks.

As can be seen in Figure 4, the seed set expansion meth-ods are much faster than Demon and Bigclam on DBLP andAmazon networks. On small networks (HepPh, AstroPh,CondMat), our algorithm with “spread hubs” is faster thanDemon and Bigclam. On large networks (Flickr, LiveJour-nal, Myspace), our seed set expansion methods are muchfaster than Bigclam even though we compare a single-threadedimplementation of our method with 20 threads for Bigclam.

6. DISCUSSION AND CONCLUSIONWe now discuss the results from our experimental investi-

gations. First, we note that our seed set expansion methodwas the only method that worked on all of the problems.Also, our method is faster than both Bigclam and Demon.

Our seed set expansion algorithm is also easy to parallelizebecause each seed can be expanded independently. Thisproperty indicates that the runtime of the seed set expan-sion method could be further reduced in a multi-threadedversion. Also, we can use any other high quality partition-ing scheme instead of Graclus including those with paralleland distributed implementations [25]. Perhaps surprisingly,the major di↵erence in cost between using Graclus centersfor the seeds and the other seed choices does not result fromthe expense of running Graclus. Rather, it arises becausethe personalized PageRank expansion technique takes longerfor the seeds chosen by Graclus and spread hubs. When thePageRank expansion method has a larger input set, it tendsto take longer, and the input sets we provide for the spreadhubs and Graclus seeding strategies are the neighborhoodsets of high degree vertices.Another finding that emerges from our results is that us-

ing random seeds outperforms both Bigclam and Demon.We believe there are two reasons for this finding. First, ran-dom seeds are likely to be in some set of reasonable conduc-tance as also discussed by Andersen and Lang [5]. Second,and importantly, a recent study by Abrahao [2] showed thatpersonalized PageRank clusters are topologically similar toreal-world clusters [2]. Any method that uses this techniquewill find clusters that look real.Finally, we wish to address the relationship between our

results and some prior observations on overlapping commu-nities. The authors of Bigclam found that the dense regionsof a graph reflect areas of overlap between overlapping com-munities. By using a conductance measure, we ought tofind only these dense regions – however, our method pro-duces much larger communities that cover the entire graph.The reason for this di↵erence is that we use the entire ver-tex neighborhood as the restart for the personalized PageR-ank expansion routine. We avoid seeding exclusively insidea dense region by using an entire vertex neighborhood as aseed, which grows the set beyond the dense region. Thus, thecommunities we find likely capture a combination of commu-nities given by the egonet of the original seed node. To ex-pand on this point, in experiments we omit due to space, wefound that seeding solely on the node itself – rather than us-

Using datasets from "Yang and Leskovec (WDSM 2013) with known overlapping

community structure

Our method outperform current state of the art

overlapping community detection methods. "

Even randomly seeded!


And helps to find real-world overlapping communities too.

34

MLG2013

Conclusion & Discussion & PPR community detection is fast "[Andersen et al. FOCS06]

PPR communities look real "[Abrahao et al. KDD2012; Zhu et al. ICML2013]

Egonet analysis reveals basis of NCP Partitioning for seeding yields "high coverage & real communities. “Caveman” communities?!!!!

MLG2013 David Gleich · Purdue 35

Gleich & Seshadhri KDD2012 Whang, Gleich & Dhillon CIKM2013 PPR Sample !bit.ly/18khzO5 !!Egonet seeding bit.ly/dgleich-code !

References

Best conductance cut at intersection of communities?

Proof Sketch 1) Large clustering coefficient "⇒ many wedges are closed 2) Heavy tailed degree dist "⇒ a few vertices have a very large degree 3) Large degree ⇒ O(d 2) wedges ⇒ “most” of wedges Thus, there must exist a vertex with a high edge density ⇒ “good” conductance Use the probabilistic method to formalize

100 101 102 103 1040

0.2

0.4

0.6

0.8

1

CD

F of

Num

ber o

f Wed

ges

Degree


MLG2013

Education

Personalized PageRank based community detection