Upload
david-gleich
View
1.896
Download
0
Tags:
Embed Size (px)
DESCRIPTION
My talk at MLG 2013 on using Personalized PageRank to find communities.
Citation preview
Personalized !PageRank based
Community Detection
David F. Gleich!Purdue University!
Joint work with C. Seshadhri, Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by NSF CAREER 1149756-CCF
Code bit.ly/dgleich-codes!
Today’s talk
1. Personalized PageRank"based community detection
2. Conductance, Egonets, and "Network Community Profiles
3. Egonet seeding 4. Improved seeding
David Gleich · Purdue 2 MLG2013
A community is a set of vertices that is denser inside than out.
David Gleich · Purdue 3 MLG2013
250 node GEOP network in 2 dimensions 4
250 node GEOP network in 2 dimensions 5
We can find communities using Personalized PageRank (PPR) [Andersen et al. 2006] PPR is a Markov chain on nodes 1. with probability 𝛼, ", "
follow a random edge 2. with probability 1-𝛼, ", "
restart at a seed
aka random surfer aka random walk with restart unique stationary distribution
David Gleich · Purdue 6 MLG2013
Personalized PageRank community detection 1. Given a seed, approximate the
stationary distribution. 2. Extract the community.
Both are local operations.
David Gleich · Purdue 7 MLG2013
Demo!
David Gleich · Purdue 8 MLG2013
Conductance communities Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community
�(S) =
cut(S)
min
�vol(S), vol(
¯S)
�(edges leaving the set)
(total edges in the set)
David Gleich · Purdue
cut(S) = 7
vol(S) = 33
vol(
¯S) = 11
�(S) = 7/11
9 MLG2013
Andersen-Chung-Lang!personalized PageRank community theorem![Andersen et al. 2006]"
Informally Suppose the seeds are in a set of good conductance, then the personalized PageRank method will find a set with conductance that’s nearly as good. … also, it’s really fast.
David Gleich · Purdue 10
MLG2013
# G is graph as dictionary-of-sets !alpha=0.99 !tol=1e-4 !!x = {} # Store x, r as dictionaries !r = {} # initialize residual !Q = collections.deque() # initialize queue !for s in seed: ! r(s) = 1/len(seed) ! Q.append(s) !while len(Q) > 0: ! v = Q.popleft() # v has r[v] > tol*deg(v) ! if v not in x: x[v] = 0. ! x[v] += (1-alpha)*r[v] ! mass = alpha*r[v]/(2*len(G[v])) ! for u in G[v]: # for neighbors of u ! if u not in r: r[u] = 0. ! if r[u] < len(G[u])*tol and \ ! r[u] + mass >= len(G[u])*tol: ! Q.append(u) # add u to queue if large ! r[u] = r[u] + mass ! r[v] = mass*len(G[v]) !
David Gleich · Purdue 11
MLG2013
Demo 2!
David Gleich · Purdue 12
MLG2013
Problem 1, which seeds?
David Gleich · Purdue 13
MLG2013
Problem 2, not fast enough.
David Gleich · Purdue 14
MLG2013
Gleich-Seshadhri, KDD 2012!
David Gleich · Purdue
Neighborhoods are good communities
15
MLG2013
Gleich-Seshadhri, KDD 2012!Egonets and Conductance
David Gleich · Purdue
Neighborhoods are good communities ^
conductance
^
Vertex
Egonets? … in graphs that look like social and information networks
16
MLG2013
Vertex neighborhoods or!Egonets The induced subgraph of"set a vertex its neighbors Prior research on egonets of social networks from the “structural holes” perspective [Burt95,Kleinberg08]. Used for anomaly detection [Akoglu10], "community seeds [Huang11,Schaeffer11], "overlapping communities [Schaeffer07,Rees10].
David Gleich · Purdue 17
MLG2013
Simple version of theorem
If global clustering coefficient = 1, then "the graph is a disjoint union of cliques. Vertex neighborhoods are optimal communities!
David Gleich · Purdue 18
MLG2013
Theorem Condition Let graph G have clustering coefficient 𝜅 and "have vertex degrees bounded "by a power-law function with exponent 𝛾 less than 3. Theorem Then there exists a vertex neighborhood with conductance
log degree
log
prob
abilit
y ↵1n/d�
↵2n/d�
4(1 � )/(3 � 2)
David Gleich · Purdue 19
MLG2013
Confession!The theory is weak
�(S) 4(1 � )/(3 � 2)
Collaboration networks "𝜅 ~ [0.1 – 0.5]
Social networks "𝜅 ~ [0.05 – 0.1]
Graph Verts Edges Avg.
Deg.
Max
Deg.
¯C
ca-AstroPh 17903 196972 22.0 504 0.318 0.633
email-Enron 33696 180811 10.7 1383 0.085 0.509
cond-mat-2005 36458 171735 9.4 278 0.243 0.657
arxiv 86376 517563 12.0 1253 0.560 0.678
dblp 226413 716460 6.3 238 0.383 0.635
hollywood-2009 1069126 56306653 105.3 11467 0.310 0.766
fb-Penn94 41536 1362220 65.6 4410 0.098 0.212
fb-A-oneyear 1138557 4404989 7.7 695 0.038 0.060
fb-A 3097165 23667394 15.3 4915 0.048 0.097
soc-LiveJournal1 4843953 42845684 17.7 20333 0.118 0.274
oregon2-010526 11461 32730 5.7 2432 0.037 0.352
p2p-Gnutella25 22663 54693 4.8 66 0.005 0.005
as-22july06 22963 48436 4.2 2390 0.011 0.230
itdk0304 190914 607610 6.4 1071 0.061 0.158
Graph Verts Edges Avg.
Deg.
Max
Deg.
¯C
ca-AstroPh 17903 196972 22.0 504 0.318 0.633
email-Enron 33696 180811 10.7 1383 0.085 0.509
cond-mat-2005 36458 171735 9.4 278 0.243 0.657
arxiv 86376 517563 12.0 1253 0.560 0.678
dblp 226413 716460 6.3 238 0.383 0.635
hollywood-2009 1069126 56306653 105.3 11467 0.310 0.766
fb-Penn94 41536 1362220 65.6 4410 0.098 0.212
fb-A-oneyear 1138557 4404989 7.7 695 0.038 0.060
fb-A 3097165 23667394 15.3 4915 0.048 0.097
soc-LiveJournal1 4843953 42845684 17.7 20333 0.118 0.274
oregon2-010526 11461 32730 5.7 2432 0.037 0.352
p2p-Gnutella25 22663 54693 4.8 66 0.005 0.005
as-22july06 22963 48436 4.2 2390 0.011 0.230
itdk0304 190914 607610 6.4 1071 0.061 0.158
Tech. networks "𝜅 ~ [0.005 – 0.05]
This bound is useless unless 𝜅 ≥ 1/2
David Gleich · Purdue 20
MLG2013
We view this theory as "“intuition for the truth”
David Gleich · Purdue 21
MLG2013
Empirical Evaluation using Network Community Profiles
large global clustering coe�cients and large mean clusteringcoe�cients.
Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).
Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.
Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].
6. EMPIRICAL NEIGHBORHOODCOMMUNITIES
6.1 ComputationWe first show that we can adapt any procedure to compute
all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:
edges(N1
(v) \ {v})/2
because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N
1
(v) \ {v})/2 = edges(N1
(v))/2 � dv. Thencut(N
1
(v)) = vol(N1
(v)) � edges(N1
(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.
6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to
show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.
web-Google itdk0304
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
fb-A-oneyear arxiv
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
soc-LiveJournal1 ca-AstroPh
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
Number of vertices in cluster Number of vertices in cluster
Figure 2: The best neighborhood community con-
ductance at each size (black) and the Fiedler com-
munity (red). (Note the axis limits on ca-AstroPh).
First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.
The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.
6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed
by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,
Community Size
Minimum conductance for
any community of the given size
Approximate canonical shape found by Leskovec, Lang, Dasgupta, and Mahoney Holds for a variety of approximations to conductance.
David Gleich · Purdue 22
MLG2013
Empirical Evaluation using Network Community Profiles
large global clustering coe�cients and large mean clusteringcoe�cients.
Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).
Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.
Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].
6. EMPIRICAL NEIGHBORHOODCOMMUNITIES
6.1 ComputationWe first show that we can adapt any procedure to compute
all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:
edges(N1
(v) \ {v})/2
because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N
1
(v) \ {v})/2 = edges(N1
(v))/2 � dv. Thencut(N
1
(v)) = vol(N1
(v)) � edges(N1
(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.
6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to
show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.
web-Google itdk0304
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
fb-A-oneyear arxiv
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
soc-LiveJournal1 ca-AstroPh
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
Number of vertices in cluster Number of vertices in cluster
Figure 2: The best neighborhood community con-
ductance at each size (black) and the Fiedler com-
munity (red). (Note the axis limits on ca-AstroPh).
First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.
The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.
6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed
by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,
Community Size"(Degree + 1)
Minimum conductance for any community
neighborhood of the given size
“Egonet community profile” shows the same shape, 3 secs to compute.
1.1M verts, 4M edges
The Fiedler community computed from the normalized Laplacian is a neighborhood!
David Gleich · Purdue
Facebook data from Wilson et al. 2009
23
MLG2013
Not just one graph
large global clustering coe�cients and large mean clusteringcoe�cients.
Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).
Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.
Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].
6. EMPIRICAL NEIGHBORHOODCOMMUNITIES
6.1 ComputationWe first show that we can adapt any procedure to compute
all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:
edges(N1
(v) \ {v})/2
because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N
1
(v) \ {v})/2 = edges(N1
(v))/2 � dv. Thencut(N
1
(v)) = vol(N1
(v)) � edges(N1
(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.
6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to
show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.
web-Google itdk0304
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
fb-A-oneyear arxiv
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
soc-LiveJournal1 ca-AstroPh
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
Number of vertices in cluster Number of vertices in cluster
Figure 2: The best neighborhood community con-
ductance at each size (black) and the Fiedler com-
munity (red). (Note the axis limits on ca-AstroPh).
First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.
The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.
6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed
by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,
large global clustering coe�cients and large mean clusteringcoe�cients.
Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).
Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.
Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].
6. EMPIRICAL NEIGHBORHOODCOMMUNITIES
6.1 ComputationWe first show that we can adapt any procedure to compute
all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:
edges(N1
(v) \ {v})/2
because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N
1
(v) \ {v})/2 = edges(N1
(v))/2 � dv. Thencut(N
1
(v)) = vol(N1
(v)) � edges(N1
(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.
6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to
show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.
web-Google itdk0304
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
fb-A-oneyear arxiv
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
soc-LiveJournal1 ca-AstroPh
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
Number of vertices in cluster Number of vertices in cluster
Figure 2: The best neighborhood community con-
ductance at each size (black) and the Fiedler com-
munity (red). (Note the axis limits on ca-AstroPh).
First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.
The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.
6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed
by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,
arXiv – 86k verts, 500k edges soc-LiveJournal – 5M verts, 42M edges
15 more graphs available www.cs.purdue.edu/~dgleich/codes/neighborhoods David Gleich · Purdue 24
MLG2013
Filling in the !Network Community Profile
large global clustering coe�cients and large mean clusteringcoe�cients.
Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).
Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.
Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].
6. EMPIRICAL NEIGHBORHOODCOMMUNITIES
6.1 ComputationWe first show that we can adapt any procedure to compute
all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:
edges(N1
(v) \ {v})/2
because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N
1
(v) \ {v})/2 = edges(N1
(v))/2 � dv. Thencut(N
1
(v)) = vol(N1
(v)) � edges(N1
(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.
6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to
show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.
web-Google itdk0304
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
fb-A-oneyear arxiv
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
soc-LiveJournal1 ca-AstroPh
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
Number of vertices in cluster Number of vertices in cluster
Figure 2: The best neighborhood community con-
ductance at each size (black) and the Fiedler com-
munity (red). (Note the axis limits on ca-AstroPh).
First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.
The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.
6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed
by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,
Minimum conductance for any community
neighborhood of the given size
We are missing a region of the NCP when we just look at neighborhoods
David Gleich · Purdue
Community Size"(Degree + 1)
25
MLG2013
Facebook Sample - 1.1M verts, 4M edges
100 101 102 103 104 105
10−4
10−3
10−2
10−1
100
maxdeg
100 101 102 103 104 105
10−4
10−3
10−2
10−1
100
maxdeg
Filling in the !Network Community Profile
large global clustering coe�cients and large mean clusteringcoe�cients.
Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).
Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.
Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].
6. EMPIRICAL NEIGHBORHOODCOMMUNITIES
6.1 ComputationWe first show that we can adapt any procedure to compute
all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:
edges(N1
(v) \ {v})/2
because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N
1
(v) \ {v})/2 = edges(N1
(v))/2 � dv. Thencut(N
1
(v)) = vol(N1
(v)) � edges(N1
(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.
6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to
show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.
web-Google itdk0304
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
fb-A-oneyear arxiv
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
soc-LiveJournal1 ca-AstroPh
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
Number of vertices in cluster Number of vertices in cluster
Figure 2: The best neighborhood community con-
ductance at each size (black) and the Fiedler com-
munity (red). (Note the axis limits on ca-AstroPh).
First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.
The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.
6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed
by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,
Minimum conductance for
any community of the given size
7807 seconds
This region fills when using the PPR method (like now!)
David Gleich · Purdue
Community Size"
26
MLG2013
Facebook Sample - 1.1M verts, 4M edges
Am I a good seed?!Locally Minimal Communities
“My conductance is the best locally.”
�(N(v )) �(N(w))
for all w adjacent to v
In Zachary’s Karate Club network, there are four locally minimal communities, the two leaders and two peripheral nodes.
David Gleich · Purdue 27
MLG2013
100 101 102 103 104 105
10−4
10−3
10−2
10−1
100
maxdeg
100 101 102 103 104 105
10−4
10−3
10−2
10−1
100
maxdeg
Locally minimal communities capture extremal neighborhoods
large global clustering coe�cients and large mean clusteringcoe�cients.
Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).
Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.
Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].
6. EMPIRICAL NEIGHBORHOODCOMMUNITIES
6.1 ComputationWe first show that we can adapt any procedure to compute
all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:
edges(N1
(v) \ {v})/2
because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N
1
(v) \ {v})/2 = edges(N1
(v))/2 � dv. Thencut(N
1
(v)) = vol(N1
(v)) � edges(N1
(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.
6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to
show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.
web-Google itdk0304
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
fb-A-oneyear arxiv
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
soc-LiveJournal1 ca-AstroPh
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
Number of vertices in cluster Number of vertices in cluster
Figure 2: The best neighborhood community con-
ductance at each size (black) and the Fiedler com-
munity (red). (Note the axis limits on ca-AstroPh).
First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.
The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.
6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed
by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,
Red dots are conductance "and size of a "
locally minimal community
Usually about 1%
of # of vertices.
The red circles – the best local mins – find the extremes in the egonet profile.
David Gleich · Purdue
Community Size"
28
MLG2013
Facebook Sample - 1.1M verts, 4M edges
100 101 102 103 104 105
10−4
10−3
10−2
10−1
100
maxdeg
100 101 102 103 104 105
10−4
10−3
10−2
10−1
100
maxdeg
Filling in the NCP!Growing locally minimal comm.
large global clustering coe�cients and large mean clusteringcoe�cients.
Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).
Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.
Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].
6. EMPIRICAL NEIGHBORHOODCOMMUNITIES
6.1 ComputationWe first show that we can adapt any procedure to compute
all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:
edges(N1
(v) \ {v})/2
because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N
1
(v) \ {v})/2 = edges(N1
(v))/2 � dv. Thencut(N
1
(v)) = vol(N1
(v)) � edges(N1
(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.
6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to
show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.
web-Google itdk0304
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
fb-A-oneyear arxiv
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
ver t s2
soc-LiveJournal1 ca-AstroPh
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!4
10!3
10!2
10!1
100
maxdeg
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
100
101
102
103
104
105
10!2
10!1
100
maxdeg
ver t s2
Number of vertices in cluster Number of vertices in cluster
Figure 2: The best neighborhood community con-
ductance at each size (black) and the Fiedler com-
munity (red). (Note the axis limits on ca-AstroPh).
First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.
The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.
6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed
by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,
PPR growing only locally min
communities, seeded from entire egonet
3 seconds
283 seconds 7807 seconds
Full NCP Locally min NCP
Original Egonet
David Gleich · Purdue
Community Size"
29
MLG2013
But there’s a small problem. Most people want to cover a network with communities! We just looked at the best.
David Gleich · Purdue 30
MLG2013
The coverage of egonet-grown communities is really bad.
David Gleich · Purdue
10−3 10−2 10−1 100
10−2
10−1
100
0.7%q=0.10
2.2%q=0.25
29.8%q=0.88
10−3 10−2 10−1 100
10−2
10−1
100
0.7%q=0.10
2.2%q=0.25
29.8%q=0.88
Coverage
Max
Con
duct
ance
Facebook network With a conductance of 0.1 (not so good) we only cover 1% of the vertices in the network.
Log-Log scale! 31
MLG2013
Whang-Gleich-Dhillon, CIKM2013 [upcoming…]
1. Extract part of the graph that might have overlapping communities.
2. Compute a partitioning of the network into many pieces (think sqrt(n)) using Graclus.
3. Find the center of these partitions. 4. Use PPR to grow egonets of these centers.
David Gleich · Purdue 32
MLG2013
Table 4: Returned number of clusters and graph coverage of each algorithm
Graph random egonet graclus ctr. spread hubs demon bigclam
HepPh coverage (%) 97.1 72.1 100 100 88.8 62.1no. of clusters 97 241 109 100 5,138 100
AstroPh coverage (%) 97.6 71.1 100 100 94.2 62.3no. of clusters 192 282 256 212 8,282 200
CondMat coverage (%) 92.4 99.5 100 100 91.2 79.5no. of clusters 199 687 257 202 10,547 200
DBLP coverage (%) 99.9 86.3 100 100 84.9 94.6no. of clusters 21,272 8,643 18,477 26,503 174,627 25000
Amazon coverage (%) 99.9 100 100 100 79.2 99.2no. of clusters 21,553 14,919 20,036 27,763 105,828 25,000
Flickr coverage (%) 76.0 54.0 100 93.6 - 52.1no. of clusters 14,638 24,150 16,347 15,349 - 15,000
LiveJournal coverage (%) 88.9 66.7 99.8 99.8 - 43.9no. of clusters 14,850 34,389 16,271 15,058 - 15,000
Myspace coverage (%) 91.4 69.1 100 99.9 - -no. of clusters 14,909 67,126 16,366 15,324 - -
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandomdemonbigclam
Student Version of MATLAB
(a) AstroPh
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandomdemonbigclam
Student Version of MATLAB
(b) HepPh
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandomdemonbigclam
Student Version of MATLAB
(c) CondMat
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandombigclam
Student Version of MATLAB
(d) Flickr
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandombigclam
Student Version of MATLAB
(e) LiveJournal
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Coverage (percentage)
Max
imum
Con
duct
ance
egonetgraclus centersspread hubsrandom
Student Version of MATLAB
(f) Myspace
Figure 2: Conductance vs. graph coverage – lower curve indicates better communities. Overall, “gracluscenters” outperforms other seeding strategies, including the state-of-the-art methods Demon and Bigclam.
Flickr social network
2M vertices"22M edges
We can cover
95% of network with communities
of cond. ~0.15.
David Gleich · Purdue
A good partitioning helps!
33
MLG2013
flickr sample - 2M verts, 22M edges
F1 F20.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24DBLP
demonbigclamgraclus centersspread hubsrandomegonet
Student Version of MATLAB
F1 F2
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
Amazon
demonbigclamgraclus centersspread hubsrandomegonet
Student Version of MATLAB
Figure 3: F1 and F2 measures comparing our algorithmic communities to ground truth – a higher barindicates better communities.
Amazon DBLP0
1
2
3
4
5
6
7
8Run time
Run
tim
e (h
ours
)
demonbigclamgraclus centersspread hubsrandomegonet
Student Version of MATLAB
Figure 4: Runtime on Amazon and DBLP – Theseed set expansion algorithm is faster than Demonand Bigclam.
5.4 Comparison of running timesFinally, we compare the algorithms by runtime. Figure 4
and Table 5 show the runtime of each algorithm. We runthe single thread version of Bigclam for HepPh, AstroPh,CondMat, DBLP, and Amazon networks, and use the multi-threaded version with 20 threads for Flickr, Myspace, andLiveJournal networks.
As can be seen in Figure 4, the seed set expansion meth-ods are much faster than Demon and Bigclam on DBLP andAmazon networks. On small networks (HepPh, AstroPh,CondMat), our algorithm with “spread hubs” is faster thanDemon and Bigclam. On large networks (Flickr, LiveJour-nal, Myspace), our seed set expansion methods are muchfaster than Bigclam even though we compare a single-threadedimplementation of our method with 20 threads for Bigclam.
6. DISCUSSION AND CONCLUSIONWe now discuss the results from our experimental investi-
gations. First, we note that our seed set expansion methodwas the only method that worked on all of the problems.Also, our method is faster than both Bigclam and Demon.
Our seed set expansion algorithm is also easy to parallelizebecause each seed can be expanded independently. Thisproperty indicates that the runtime of the seed set expan-sion method could be further reduced in a multi-threadedversion. Also, we can use any other high quality partition-ing scheme instead of Graclus including those with paralleland distributed implementations [25]. Perhaps surprisingly,the major di↵erence in cost between using Graclus centersfor the seeds and the other seed choices does not result fromthe expense of running Graclus. Rather, it arises becausethe personalized PageRank expansion technique takes longerfor the seeds chosen by Graclus and spread hubs. When thePageRank expansion method has a larger input set, it tendsto take longer, and the input sets we provide for the spreadhubs and Graclus seeding strategies are the neighborhoodsets of high degree vertices.Another finding that emerges from our results is that us-
ing random seeds outperforms both Bigclam and Demon.We believe there are two reasons for this finding. First, ran-dom seeds are likely to be in some set of reasonable conduc-tance as also discussed by Andersen and Lang [5]. Second,and importantly, a recent study by Abrahao [2] showed thatpersonalized PageRank clusters are topologically similar toreal-world clusters [2]. Any method that uses this techniquewill find clusters that look real.Finally, we wish to address the relationship between our
results and some prior observations on overlapping commu-nities. The authors of Bigclam found that the dense regionsof a graph reflect areas of overlap between overlapping com-munities. By using a conductance measure, we ought tofind only these dense regions – however, our method pro-duces much larger communities that cover the entire graph.The reason for this di↵erence is that we use the entire ver-tex neighborhood as the restart for the personalized PageR-ank expansion routine. We avoid seeding exclusively insidea dense region by using an entire vertex neighborhood as aseed, which grows the set beyond the dense region. Thus, thecommunities we find likely capture a combination of commu-nities given by the egonet of the original seed node. To ex-pand on this point, in experiments we omit due to space, wefound that seeding solely on the node itself – rather than us-
Using datasets from "Yang and Leskovec (WDSM 2013) with known overlapping
community structure
Our method outperform current state of the art
overlapping community detection methods. "
Even randomly seeded!
David Gleich · Purdue
And helps to find real-world overlapping communities too.
34
MLG2013
Conclusion & Discussion & PPR community detection is fast "[Andersen et al. FOCS06]
PPR communities look real "[Abrahao et al. KDD2012; Zhu et al. ICML2013]
Egonet analysis reveals basis of NCP Partitioning for seeding yields "high coverage & real communities. “Caveman” communities?!!!!
MLG2013 David Gleich · Purdue 35
Gleich & Seshadhri KDD2012 Whang, Gleich & Dhillon CIKM2013 PPR Sample !bit.ly/18khzO5 !!Egonet seeding bit.ly/dgleich-code !
References
Best conductance cut at intersection of communities?
Proof Sketch 1) Large clustering coefficient "⇒ many wedges are closed 2) Heavy tailed degree dist "⇒ a few vertices have a very large degree 3) Large degree ⇒ O(d 2) wedges ⇒ “most” of wedges Thus, there must exist a vertex with a high edge density ⇒ “good” conductance Use the probabilistic method to formalize
100 101 102 103 1040
0.2
0.4
0.6
0.8
1
CD
F of
Num
ber o
f Wed
ges
Degree
David Gleich · Purdue 36
MLG2013