Detecting Community Structure in Network

Detecting Community Detecting Community Structure in NetworkStructure in Network

Seung Woo Son

KAIST

2004 summer intensive studies on complex networks

2004. 8. 11.

http://cnrl.snu.ac.kr/index.php?display=ExhaustiveListOfNetworkPapers/SummerStudy2004plan



http://cnrl.snu.ac.kr/

Clustering of dataClustering of data

Partitional clustering methods Important technique in data analysis Divide the data according to natural classes Pattern recognition, learning, astrophysics, and networ

k analysis

NN multivariable data points multivariable data points

ix

Ni ,,2,1

oD-dimensional vector space

metricmetric

On networkOn network

NN vertices (nodes) vertices (nodes)

No prior informationNo prior information

Only know the edge Only know the edge (link) connectivity(link) connectivity

: Structural information: Structural information

How can we divide the network into several parts?How can we divide the network into several parts?

== How can we find the “community” structure?How can we find the “community” structure?

Web page having same topic, hidden social relationship, Web page having same topic, hidden social relationship, distribute processes to processors in a parallel computer, etc.distribute processes to processors in a parallel computer, etc.

applicationapplicationss

Community, clusterCommunity, cluster

Functional modules in cellular and genetic network P. Holme, M. Huss, and H. Jeong, Bioinformatics 19, 532 (2003). D. Wilkinson and B. A. Huberman, Proc. Natl. Acad. Sci. USA 10.1073

/pnas.0307740100 (2004). A. Vespignani, Nature Genetics 35, 118 (2003).

Cultural society or important source of a person’s identity in social network J. Scott, Social Network Analysis: A Handbook, Sage Publications 2nd e

d. (2000). A bundle of web pages on common topics etc. Community, module, (cohesive) subgroup, cluster, clique e

tc. Computer science, mathematics, sociology, biology, and p

hysics are related in this community finding problem.

Structural definition of communityStructural definition of community

Groups of vertices within which connections are dense, but between which connections are sparser. Because we don’t have any prior information

about network.

ModularityModularity

2e e Tr i ijk

kiijii eeeQ

j. groupin those toi groupin rticesconnect vethat

network original in the edges offraction the: ije

Key points (Highlight)Key points (Highlight)

What property or measure of network is used in this algorithm or method? eigenvalue and eigenvector, spectrum of adjacency matrix. Edge betweenness, information centrality. Distance, dissimilarity index, edge clustering coefficient, etc.

Agglomerative or divisive? What is the required prior information here?

Whether there is community or not. How many modules are there.

Performance of partitioning results and computational complexity.

We will review about 11 different methods recently studied. If you are boring, ask me a question.

Physical meaning?Physical meaning?

1. Spectral bisection (old 1. Spectral bisection (old one)one)

M. Fiedler, Czrch. Math. J. 23, 298 (1973)A. Pothen, H. Simon, and K.-P. Liou, SIAM J. Matrix Anal. Appl. 11, 430 (1990)F. R. K. Chung, Spectral Graph Theory, Amer. Math. Soc. (1997)http://www.cs.berkeley.edu/~demmel/cs267/lecture20/lecture20.html

A D L Laplacian L of n-vertex undirected graph G

- D is the diagonal matrix of vertex degree k.

- A is the adjacency matrix.

1

2

34

5

10100

01100

11200

00011

00011

L

6

10

2

1

3

10

6

10

2

1

3

10

6

200

3

10

02

100

2

1

02

100

2

1

V

3,2,1,0,0E

1,1,1,1,11 is always eigenvector with eigenvalue 0.

1

2

34

5

10100

01100

11301

00011

00112

L

26.024.071.042.045.0

26.024.071.042.045.0

81.032.0020.045.0

14.054.0070.045.0

44.070.0034.045.0

V

17.4 ,31.2 ,00.1 ,52.0 ,00.0E

The eigenvector corresponding to the lowest eigenvalue must haveboth positive and negative elements.

Algebraic connectivity

: How good the split is, with smaller values

corresponding to better splits.

Bisect !

The spectral bisection method is reasonably fast.

General n by n matrix case, O(n3) time complexity.However, sparse matrix case, Lancozos method

reduces it to approximately .23

m G. H. Golub and C. F. Van Loan, Matrix computations.

Johns Hopkins University Press, Baltimore, MD (1989)

2. The Kernighan-Lin (KL) 2. The Kernighan-Lin (KL) algorithmalgorithm

B. W. Kernighan and S. Lin, Bell System Technical Journal 49, 291 (1970)http://www.cs.berkeley.edu/~demmel/cs267/lecture18/lecture18.html

Benefit function Q

The number of edges that lie within the two groups minus the number that lie between them.

A B1. We should Specify the size of the two groups. N(A), N(B)2. Calculate the ∆Q for all possible exchange pair from A and B.3. Choose the pair that maximizes the change of Q. (greedy algorithm)4. Repeat 2 & 3 until all vertices have been swapped once. (any vertex that has been swapped is never swapped. )5. Go back over the sequence of swaps and find the highest Q.

Bisect !

- This algorithm requires a priori what the size of the groups will be.

- It runs moderately quickly, in worst case time O(n2). However, if we don’t know the size, It will increase to O(n3).

- The best values of Q are always achieved for very asymmetric trivial division.

3. Newman fast algorithm3. Newman fast algorithm

M. E. J. Newman, cond-mat/0309508 (PRE in press)ModularityModularity

2e e Tr i ijk

kiijii eeeQ Maximize Q by greedy algorithm !

Generally the number of ways to divide n vertices into g non-emptygroups is given by the Stirling number of the second kind S(n,g),

and hence the number of distinct community divisions is .

n

g

gnS1

),(n2

Agglomerative hierarchical clustering method!

1. Separate each vertex solely into n community.

2. Calculate the increase of Q for all possible community pairs.

3. Choose the mergence of the greatest increase in Q.

4. Repeat 2 & 3 until the modularity Q reaches the maximal value.Time Complexity - O(mn) O(n2) on sparse graph.

4. q-state Potts method or RB 4. q-state Potts method or RB methodmethod (Reihardt-Bornholdt method)

J. Reichardt and S. Bornholdt, cond-mat/0402349 (2004)

q-state Potts model on network

q

s

ss

Eji

nnJH

ji1),(

, 2

)1(

Hamiltonian :

Nearest neighbor ferromagnetic interaction of

the Potts model : homogeneous distribution of

spin

Diversity : global anti-ferromagnetic interaction.

q = N/5 is reasonable for application.)1(

2

NN

mpc

Monte-Carlo heat-bath algorithm and simulated annealing

qnnnnq

qN

n

m ,,, max with 1

1

21max

max

magnetization

q

s

q

isissss eaaeQ

1 1

2 with )(

22 mmT

N

128 nodes computer-generated (proposed by Newman) network, 4 groups of 32 nodes each. Average of 16 l

inks ( zin+zout=16 )

5. Hierarchical clustering5. Hierarchical clustering

Dendrogram

1. Measure of similarity xij between pairs (i,j) of vertices.2. Single linkage, complete linkage, or average linkage.

metric

Structural equivalence : Two vertices are said to be structurally equivalent if they have the same set of neighbours. How many same friends they have.

jik

ijikijx,

2)AA(

ji

kjjkiik

ij

nx

)A)(A(1

Euclidean distance Pearson correlation

K-components :Two vertices in the same community have at least k independent paths between them. The count of edge-independent path (max-flow) betweenvertices.

Time complexity Max(O(mn), O(n2logn) )because of the sorting of n2 similarity.

6. Zhou dissimilarity index 6. Zhou dissimilarity index methodmethod

H. Zhou, Phys. Rev. E 67, 061901 (2003)H. Zhou, Phys. Rev. E 67, 041908 (2003)

The distance dij from vertex i to vertex j is defined as the average number of steps needed for a Brownian particle on this network to move from vertex i to vertex j.

)(~1

kA

AP N

l il

ijij

Transfer matrix (jumping probability)

)(

1

1,

il

N

lji jBI

d

Distance I is N by N identity matrix.

B(j) is equals to P except that Blj(j) = 0 for all l.

1

1

)(

,

,1

jN

j

d

d

jBI

2),(

,

2

,,

N

ddji

N

jik kjki

Dissimilarity index

7. Girvan-Newman (GN) 7. Girvan-Newman (GN) algorithmalgorithm

M. Girvan and M. E. J. Newman, PNAS 99, 7821 (2002)M. E. J. Newman and M. Girvan, Phys. Rev. E 69, 026113 (2004)A B

The few edges that lie between communities can be thought of as forming “bottlenecks” between the

communities.Betweenness and edge betweenness :

The number of geodesic (i.e., shortest) paths between vertex pairs that run along the edge in question, summed over all vertex pairs.

Edge removal :

After calculating the betweenness of all edges in the network, remove the one with highest betweenness.

Recalculate after edge removal and repeat it until the modularity Q is maximum.

Time complexity O(m2n)

8. Tyler-Wilkinson-Huberman (TWM) method8. Tyler-Wilkinson-Huberman (TWM) method

J. R. Tyler, D. M. Wilkinson, and B. A. Huberman, cond-mat/0303264 (2003)

Variation of Girvan-Newman algorithm to improve the calculating speed.

Tyler et al. suggest instead summing up over all node only a subset of vertices i be summed over, giving partial betweenness score for all edges; if a random sample is chosen, this will give a Monte Carlo estimate of betweenness.

The number of vertices sampled is chosen so as to make the betweenness of at least one edge in the network greater than a certain threshold.

This stochastic approach reduces the time complexity from O(m2n) to O(m2)

9. RCCLP method or Parisi method9. RCCLP method or Parisi method(Radicchi-Castellano-Cecconi-Loreto-Parisi method)

F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi, PNAS 101, 2658 (2004)

Edge clustering coefficient

])1(),1(min[

1)3(.)3(

,

ji

jiji kk

zC

Definition of community in a strong sense :

Definition of community in a weak sense :

. , )()( ViVkVk outi

ini

. )()(

Vi

outi

Vi

ini VkVk

)3(, jiz : the number of triangles built on that edge.

i j

6 5

i j

6 5

)(,

)(,)(

,

1gji

gjig

ji s

zC

Edge coefficient of order g :

Time complexity O(m4/n2) ~ O(n2)Edge clustering coefficient is strongly

negatively correlated with edge betweenness.

This algorithm relies on the presence of triangles in the network. Clearly if a network has few triangles in the first place, then the edge clustering coefficient will be small for all edges, and the algorithm will

be fail to find the communities.

10. Information centrality 10. Information centrality methodmethod (Fortunato-Latora-Marchiori method)

S. Fortunato, V. Latora, and M. Marchiori, cond-mat/042522 (2004)

Network efficiency E 1

)1(

1

)1(][

ji ij

jiij

dNNNNGE

Information centrality CI

][

][][ '

GE

GEGE

E

EC kI

k

Iterative removal of the edges with the highest information centrality

Time complexity O(m3n)

64 nodes computer-generated network.256 edges, 4 groups of 16 nodes each.

11. Flake’s max-flow 11. Flake’s max-flow methodmethod (Flake-Lawrence-Giles-Coetzee method)

G. W. Flake, S. R. Lawrence, C. L. Giles, and F. M. Coetzee, IEEE Computer 35, 66 (2002)

Web community

Starting page orseed Web sites

Find the boundary of community using max-flow and min-cut.

Without the text information only link information.Ex) Page Rank, Hyperlink Induced Topic Search(HIT)

Simple example of max-flow, min-cut

Spectral analysis : eignevalue and eigenvector ofLaplacian or transfer matrix

Optimization Approach :Hamiltonian, benefit function,

or modularity Q

Edge removal :betweenness,

information centrality, clustering coefficient, etc.

Hierarchical clustering :metric ( Euclidian, correlation,

similarity, etc. )

12. ESMS method or K. Sneppen method12. ESMS method or K. Sneppen method( Eriksen-Simonsen-Maslov-Sneppen method )

K. A. Eriksen, I. Simonsen, S. Maslov, and K. Sneppen, Phys. Rev. Lett. 90, 148701 (2003)

13. CSCC method or Capocci method13. CSCC method or Capocci method( Capocci-Servedio-Caldarelli-Colaiori method )

A. Capocci, V. D. P. Servedio, G. Caldarelli, and F. Colaiori, cond-mat/0402499

14. Donetti-Mu14. Donetti-Muññoz (DM) methodoz (DM) methodL. Donetti and M. A. Muñoz, cond-mat/0404652

15. Wu-Huberman (WH) method15. Wu-Huberman (WH) method

F. Wu and B. A. Huberman, cond-mat/310600

16. Costa’s Hub-based flooding 16. Costa’s Hub-based flooding methodmethod

L. F. Costa, cond-mat/0405022 (2004)

Documents

Detecting Community Structure in Network