10
arXiv:0808.2633v4 [physics.soc-ph] 20 Jun 2009 Towards real-time community detection in large networks Ian X.Y. Leung, Pan Hui, Pietro Li` o, and Jon Crowcroft Computer Laboratory, University of Cambridge, Cambridge CB3 0FD, U.K. The recent boom of large-scale Online Social Networks (OSNs) both enables and necessitates the use of parallelisable and scalable computational techniques for their analysis. We examine the prob- lem of real-time community detection and a recently proposed linear time—O(m) on a network with m edges—label propagation or “epidemic” community detection algorithm. We identify character- istics and drawbacks of the algorithm and extend it by incorporating different heuristics to facilitate reliable and multifunctional real-time community detection. With limited computational resources, we employ the algorithm on OSN data with 1 million nodes and about 58 million directed edges. Experiments and benchmarks reveal that the extended algorithm is not only faster but its commu- nity detection accuracy compares favourably over popular modularity-gain optimization algorithms known to suffer from their resolution limits. PACS numbers: 89.75.Hc, 87.23.Ge, 89.20.Hh, 05.10.-a I. INTRODUCTION Recent years have seen the flourishing of numerous On- line Social Networks (OSNs). Cyber communities such as Facebook, MySpace and Orkut, where users can keep in touch with friends on the Internet, have all emerged as top 10 sites globally in terms of traffic. Tools and algorithms to understand the network structures have consequently emerged as popular research topics. By their nature, OSNs contain an immense number of per- son nodes which are sparsely connected. Edges are often bidirectional since a mutual agreement is required be- fore such friendship links are established. One of the most notable phenomenon in such networks is the resem- blance of the so-called 6-degree of separation [1] where on average every person is related to another random person via 5 other people in the real world. This has indeed been shown in real life communities and, much more conveniently, on online communities [23]. Networks which exhibit such small degrees of separation while be- ing sparsely connected are famously known as Small- World Networks [2]. Well established online communities often contain tens of millions of users connected by some billions of edges which enable—and necessitate—the use of parallelisable and scalable computational techniques for their analysis. In this literature, we examine the problem of network community detection. Graphically, such communities are characterized by a group of nodes which are densely con- nected by internal edges but less so towards the outside of the communities, as depicted by the densely connected subgraphs in Fig. 1. Understanding the community structure and dynamics of networks is vital for the design of related applications, devising business strategies and may even have direct implications on the design of the networks themselves [3]. We empirically analyse a recently proposed commu- * Electronic address: [firstname.lastname]@cl.cam.ac.uk FIG. 1: Snapshot of a subgraph of an OSN (500 nodes). nity detection technique by label propagation discussed in [4], which is summarised as follows. Each node in a network is first given a unique label. Every iteration, each node is updated by choosing the label which most of its neighbours have (the maximal label). If there hap- pens to be multiple maximal labels (which is typical in the beginning), one label is picked randomly. Previous results have shown that this algorithm is extremely effi- cient in uncovering accurate community structure. As an example, we apply the algorithm on a set of OSN con- nection data crawled by Mislove et al. [3] of 3 million nodes connected by roughly 0.2 billion directed links. We give a survey of related work in the next Section and look further into the characteristics of the algorithm in Section III. We discuss the potential implementations, improvements and applications of the algorithm on dif-

PACS numbers: 89.75.Hc, 87.23.Ge, 89.20.Hh, 05.10.-a. 1: Snapshot of a subgraph of an OSN (500 nodes). nity detection technique by label propagation discussed in [4], which is summarised

  • Upload
    vandieu

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

arX

iv0

808

2633

v4 [

phys

ics

soc-

ph]

20

Jun

2009

Towards real-time community detection in large networks

Ian XY Leunglowast Pan Huilowast Pietro Liolowast and Jon Crowcroftlowast

Computer Laboratory University of Cambridge Cambridge CB3 0FD UK

The recent boom of large-scale Online Social Networks (OSNs) both enables and necessitates theuse of parallelisable and scalable computational techniques for their analysis We examine the prob-lem of real-time community detection and a recently proposed linear timemdashO(m) on a network withm edgesmdashlabel propagation or ldquoepidemicrdquo community detection algorithm We identify character-istics and drawbacks of the algorithm and extend it by incorporating different heuristics to facilitatereliable and multifunctional real-time community detection With limited computational resourceswe employ the algorithm on OSN data with 1 million nodes and about 58 million directed edgesExperiments and benchmarks reveal that the extended algorithm is not only faster but its commu-nity detection accuracy compares favourably over popular modularity-gain optimization algorithmsknown to suffer from their resolution limits

PACS numbers 8975Hc 8723Ge 8920Hh 0510-a

I INTRODUCTION

Recent years have seen the flourishing of numerous On-line Social Networks (OSNs) Cyber communities suchas Facebook MySpace and Orkut where users can keepin touch with friends on the Internet have all emergedas top 10 sites globally in terms of traffic Tools andalgorithms to understand the network structures haveconsequently emerged as popular research topics Bytheir nature OSNs contain an immense number of per-son nodes which are sparsely connected Edges are oftenbidirectional since a mutual agreement is required be-fore such friendship links are established One of themost notable phenomenon in such networks is the resem-blance of the so-called 6-degree of separation [1] whereon average every person is related to another randomperson via 5 other people in the real world This hasindeed been shown in real life communities and muchmore conveniently on online communities [23] Networkswhich exhibit such small degrees of separation while be-ing sparsely connected are famously known as Small-World Networks [2]

Well established online communities often contain tensof millions of users connected by some billions of edgeswhich enablemdashand necessitatemdashthe use of parallelisableand scalable computational techniques for their analysisIn this literature we examine the problem of networkcommunity detection Graphically such communities arecharacterized by a group of nodes which are densely con-nected by internal edges but less so towards the outsideof the communities as depicted by the densely connectedsubgraphs in Fig 1 Understanding the communitystructure and dynamics of networks is vital for the designof related applications devising business strategies andmay even have direct implications on the design of thenetworks themselves [3]

We empirically analyse a recently proposed commu-

lowastElectronic address [firstnamelastname]clcamacuk

12

236

1

237

224

212

208

214

189

109

11

83

7

4

13

15

2

80

5

180

74

175

81

4177

170

85

63

72

67

168

87

65

32

6889

42

43

202

190

198

211

230

44

628447

8837

54

60

169

64

16

90

173

177

176

33

58

178

179

73

38

56

59

8318

171

17217

167

57

78

45

86

29

24

46

499

309

487

474

403

475

371

490

174

495

294

389

484

385

6193

244219

183228

195192

187

223199

217

28265

333

493

486

454407343

337496391

459 347312

365

465436

383

354400395

433

471393

321408

430455

392

467

285

299

338

339

298

444

442

301

463

461

441

470

413

346

290

279

404

377

352 34

440

282

351

304

66

5153

70

79

69

7650

25

27

61

356

39

382

19

23

443

500

31

4921

451

349368

424

276

305

344

401

482

458

355

447

380

362488330270

272

323429363

283

288

359318

329

381

476419

466414317

287399 284

295

361

483

480

448

390

327

434 415452

428

384

481

472

456

268497

335 291280

479303491

450

489 398 498

460

316 286

366

372

328

307275

267

314

435

492

386

311

334

266

402

426453416 446357

427 387468

376

326

319296277324

409449

350485

423394

396473

369

325

477

494

364

431 457

438

406 360310281

367

353411378

322

340

341

478278

422370412

418331

345

336

300

313274

410

445 432348

420379 397375

374358421464

373332

293289

320308405469

342437306

273439425

269

388292302

297

271 462

417315

132122

150

153

164

147158

138 157

140

154

156

135

119

137

165

155

149

130

145

162

136

151159

131

128166

139

125

144

133

141118

142126

124148

134

97

127123

163

161146 143

121

120

152

160

114 113117

93

129

238

218213

221231

233

215

48

22

225

20

23555

181

242

234

207 201

185

206

227

216197

241

232

220

204239

182

246

200

194

191

229

92

91

82 245

186

209

203

210

205

36

226

188

222

196

184

240

250 260

263258

247253

248251

252257255

261256

259262

14 264254

249

103102

94

104

105

98

106

95101

99

110

111

100

107

714035

2652

243

108

30

115

10975

11696

112

FIG 1 Snapshot of a subgraph of an OSN (500 nodes)

nity detection technique by label propagation discussedin [4] which is summarised as follows Each node in anetwork is first given a unique label Every iterationeach node is updated by choosing the label which mostof its neighbours have (the maximal label) If there hap-pens to be multiple maximal labels (which is typical inthe beginning) one label is picked randomly Previousresults have shown that this algorithm is extremely effi-cient in uncovering accurate community structure As anexample we apply the algorithm on a set of OSN con-nection data crawled by Mislove et al [3] of 3 millionnodes connected by roughly 02 billion directed links

We give a survey of related work in the next Sectionand look further into the characteristics of the algorithmin Section III We discuss the potential implementationsimprovements and applications of the algorithm on dif-

2

ferent types of networks (Section IV) Section V givesdetailed comparisons between the label propagation al-gorithm (LPA) and fast modularity-optimization algo-rithms We conclude the paper with future directions ofresearch in Section VI

II RELATED WORK

Community detection in complex networks has at-tracted ample attention in recent years Apart fromOSNs researchers have engaged in community analysis invarious types of networks In the case of the Internet ex-amples of communities are found in autonomous systems[5] and indeed web pages of similar topic [6] In biologi-cal networks it is widely believed that modular structureplays a crucial role in biological functions [7] Relatedliteratures such as [8 9 10] may serve as introductoryreading which also include methodological overviews andcomparative studies of different algorithms

The detection of community structure in a network isgenerally intended as a procedure for mapping the net-work into a tree [11] known as dendrogram In this treethe leaves are the nodes and the branches join them or(at a higher level) groups of them thus identifying a hier-archy of communities Nodes can either be agglomeratedsuccessively starting from single nodes (agglomerative)or the whole network can be recursively partitioned (divi-sive) Newman and Girvan introduced a seminal divisivealgorithm in which the selection of the edge to be cutis based on the value of its edge betweenness [12] thenumber of shortest paths between all node pairs runningthrough it It is clear that when a graph is made oftightly bound clusters each loosely interconnected allshortest paths between nodes in different clusters haveto go through the few inter-cluster connections whichtherefore have a large betweenness value Recursivelyremoving these large betweenness edges would partitionthe network into communities of different sizes

Quantitatively however we need a metric to measurehow well the community detection is progressing other-wise most algorithms would either continue until everynode is split into a single community or all join togetherinto one Newman and Girvan proposed in [12] a mea-sure of the goodness of communities called modularityfor the set of uncovered communities C the modularityis defined to be

Q =sum

cisinC

(

Ic

Eminus

(

2Ic + Oc

2E

)2)

(1)

where Ic indicates the total number of internal edges thathave both ends in c Oc is the number of outgoing edgesthat have only one end in c and E is the total numberof edges This measure essentially compares the numberof links inside a given module with the expected valuefor a randomized graph of the same size and same degreesequence

The concept of modularity has gained such popular-ity that it has not only been used as a measure ofthe community partitioning of a network but also asa key fitness indicator in various community detectionalgorithms The algorithm proposed by Clauset New-man and Moore (CNM) [13] which greedily combinesnodescommunities to optimize modularity gain is per-haps to date one the most popular algorithms in detect-ing communities in relatively large scale networks Inthe time when CNM was proposed it was then the onlyalgorithm capable of community detection on networksof size 500000 in a matter of hours Throughout theyears several variations of the CNM have been proposed[14 15 16] Most of them concentrate on more efficientdata structures as well as modularity gain heuristics toimprove the overall performance A latest adaptation[16] that treats newly combined communities as a singlenode after each iteration is able to identify communitystructure on a network containing 1 billion edges in amatter of hours

It is vital however to understand that modularityis not a scale-invariant measure and hence by blindlyrelying on its maximization detection of communitiessmaller than a certain size is impossible This is famouslyknown as the resolution limit [17] of modularity basedalgorithms Since LPA does not involve modularity op-timization its community detection capability is scale-independent and therefore not affected by the resolutionlimit as will be shown in Section V

III DISCUSSION

Here we give a brief discussion on the characteristicsof the algorithm as well as some preliminary results ap-plying the algorithm on the OSN described above

A A ldquonear linear timerdquo algorithm

One can consider the label spreading as a simplifiedbut specific case of epidemic spreading where all indi-viduals are considered infectious with their own uniquedisease Each person is infected by a disease that isprevalent in his or her neighbourhood Fig 2 depictsthe labelling convergence seen in a 4-clique The numberof clusters monotonically decreases each iteration as cer-tain labels become extinct due to domination by otherlabels With certain rare and exceptional cases the la-belling self-organises to an unsupervised equilibrium ef-ficiently

As suggested in [4] certain properties may prevent theequilibrium from occurring For instance a network witha bipartite structure might render the system to oscil-late if the algorithm is run synchronously ie all nodesare updated together only after they have selected theirmaximal labels Running the algorithm asynchronously

in a randomized order every iteration as suggested in

3

FIG 2 Each node is looked at in a certain order and a newlabel is selected The above shows how nodes in a 4-cliqueself-organise into one single community in one iteration

the paper may result in less definitive results but solvesthe problem It was also suggested that a node that hastwo equally maximal labels to choose from may fail toconverge and an extra stopping criterion to prevent theswitching of label would have to be in place It is how-ever noted in our implementation that including the con-cerned label itself into the maximal label consideration ef-fectively avoids all the above non-convergent behavioursand the requirement for an extra stopping criterion

In one iteration each nodersquos neighbours are examinedand the maximal label is chosen The running time ofthis algorithm is therefore O(knd) where k is the numberof iterations n the number of nodes and d the averagedegree of nodes Note that nd can also be describedby m the number of edges The number of iterationsrequired k is dependent on the stopping criterion but isnot very well understood [4] suggested that the numberof iterations required is independent to the number ofnodes and that after 5 iterations 95 of their nodes arealready accurately clustered

Since labels can hardly affect nodes outside their lo-cal densely connected substructures the convergent be-haviour should be dependent on these substructuresrather than the whole network This is confirmed bypreliminary testing and directs us to look at substruc-tures which can ultimately become the community Ex-periments show that the average number of iterations re-quired for the labelling to converge (no change in labels)in an N -clique for the asynchronous and synchronous im-plementations are 21 and 36 respectively highly inde-pendent of N To further investigate the average con-vergent behaviour on a substructure we look at Fig 3which summarises the relationship between number of it-erations required before convergence k to the pairwiseconnectivity p that controls the edge density in a ran-dom graph of size N (where p = 1 corresponds to theN -clique)

In both implementations we see that k remains fairlyconstant over both N and p until p reaches a certainthreshold which when reached we begin to see an inversedependence between N and k The overall averages ofasynchronous and synchronous implementations in thiscase are 28 and 52

Let us however consider another simple but non-random topology Suppose we start off with an N -Clique

SyncAsync

12345678910

N (x 100)

01 02 03 04 05 06 07 08 09 1

p

2 3 4 5 6 7 8 9

10 11

k

FIG 3 The above plots show the number of iterations re-quired before convergence for both the synchronous and asyn-chronous implementations on a random graph of size N withprobability of pairwise connection p All values here are av-eraged over 100 realisations

at each jth construction the graph is grown by connect-ing the N minusj most recently joined nodes to the new node(cf Fig 4)

FIG 4 This substructure is constructed on an N-clique N =25 by attaching each new node labelled l N lt l lt 2N toexisting nodes l minus 1 2(l minus N) thus contains 49 (2N minus 1)nodes and 600 (N(N minus 1)) edges

These structures by construction will converge into asingle community by LPA Without worrying about howabundant such patterns are in real world communitieswe look at the convergent behaviour shown in Fig 5The trend clearly reveals that k grows logarithmicallywith respect to N We therefore suggest the possibleworst case of k of the order of O(log N) where N is thesize of the largest substructure with a topology similarto the above Indeed we anticipate real world socialnetworks to contain highly heterogeneous substructureswhich may be intricately connected to affect each otherrsquos

4

0

5

10

15

20

25

1 10 100 1000

k

N

SyncAsync

FIG 5 The relationships between the number of iterationsrequired before convergence k of both implementations tothe size N of the aforementioned structure All values hereare averaged over 100 realisations

convergence We thus consider the understanding of theconvergent behaviour in large complex networks such asOSNs as a direction for further investigation

B Community Detection in OSN

We carry out community detection on the aforemen-tioned OSN using a desktop PC with 4GB ram and a 24GHz quad-core processor running 32-bit Java VM 16Due to limited memory we restrict the number of nodesto the first million Since the order of nodes in the origi-nal data corresponds to that of a breath-first web crawl-ing this way of ldquocutting offrdquo the data is equivalent to ex-tracting a snowball sample As discussed in [3] snowballmethods are known to over-sample high-degree nodesunder-sample low-degree ones and overestimate the av-erage node degree This is seen by the higher average de-gree of the subgraph 250 compared to 106 of the originalgraph Nonetheless since the purpose of this literatureis to evaluate the algorithm on large-scale networks thesampled network satisfies our requirements The sam-pled subgraph contains 1000000 nodes and 58793458directed links Convergent behaviours of the two differ-ent implementations are shown in Fig 6

A crucial point is that in a complex network as largeas this the so called ldquoconvergencerdquo does not necessarilyyield an optimal result in terms of modularity For exam-ple we see the asynchronous implementation merely tookon average 5 iterations to achieve a maximum modularitybut has highly volatile results in different runs as depictedby the shaded area in the figure On the other hand thesynchronous implementation achieved maximum modu-larity much slower than the asynchronous version butits performance on average is much more stable (its per-formance range is thus omitted) The performances of

Mod

ular

ity Q

No

of C

omm

uniti

es

Iteration

Range(Async)Avg Q(Async)Avg Q(Sync)

Avg No of Comm(Async)Avg No of Comm(Sync)

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

10000

20000

30000

40000

50000

FIG 6 Average performances of asynchronous and syn-chronous LPA Values are averaged over 5 Runs Shaded areadenotes the range of the performances of asynchronous imple-mentation

these two different implementations are equally impor-tant to be understood and utilised Further discussionson the implications of these implementations and theirutilizations are given in Section IV

Each single-threaded iteration finishes in a matter oftens of second and thus depending on the stopping cri-terion it can take as little as 8 to 10 minutes up to peakperformance Extrapolating the time required with re-spect to the number of edges the algorithm without anyoptimization should be able to detect communities on agraph with 1 billion edges in less than 180 minutes in amagnitude similar to that in [16]

Fig 7 shows the distribution of communityclustersize collected by a specific run of the asynchronous ver-sion of the algorithm when the modularity peaked at0638 The size distribution of communities within theOSN follows a 2-part power law distribution in the com-plementary CDF with an estimated coefficient of 11The interested reader is referred to [10 18] for discus-sions on the characteristics of different networks

IV A MORE RELIABLE AND EFFICIENT

ALGORITHM

In this section we discuss potential modifications tothe algorithm to increase its reliability functionality andcomputational efficiency

A Hop Attenuation amp Node Preference

Due to the ldquoepidemicrdquo nature of the algorithm a majorlimitation of the algorithm is noted where certain ldquolabelepidemicrdquo manages to ldquoplaguerdquo a large amount of nodesTo be exact in some runs a certain community of size

5

P(S

gts)

Community Size s

00001

0001

001

01

1

1 10 100 1000 10000 100000 1e+006

FIG 7 The community-size distribution of communities un-covered by the algorithm which follows a 2-part power law

over 500000 (50 of the number of nodes) is formedmdashasopposed to the thousand other counterparts averagelysized in a magnitude of 100smdashgreatly contributing tomodularity drop after the peak We conjecture that thisis partially due to the asynchronous nature of the algo-rithm and the initial formation of communities wherecertain communities do not form strong enough links toprevent a foreign ldquoepidemicrdquo to sweep through Furtherexperiments confirm that the synchronous version of thealgorithm slows down the formation of such ldquomonsterrdquocommunities but do not prevent them

We propose an extension to this algorithm by addinga score associated with the label which decreases as ittraverses from its origin A node is initially given a scoreof 10 for its label After a node i has collected fromits neighbourhood Ni all the respective labels and thescores the calculation of the new maximal label Lprime canbe generalised by

Lprimei = argmax

L

sum

iprimeisinNi

siprime(Liprime) middot f(iprime)mmiddot wiprimei (2)

where Li is the label of node i si(L) is the hop scoreof label L in i wiprimei is the weight of the edge betweeniprime and i (we sum the weights in both directions if thegraph is directed) and f(i) is any arbitrary comparablecharacteristic for any node i For instance if we definef(i) = Deg(i) when m gt 0 more preference is given tonode with more neighbours m lt 0 less The final stepis to assign a new attenuated score sprime to the new label Lprime

of i by subtracting hop attenuation δ 0 lt δ lt 1

sprimei(Lprimei) =

(

maxiprimeisinNi(Lprime

i)si(Liprime)

)

minus δ (3)

where Ni(L) is the set of neighbours of i that has la-bel L The value δ governs how far a particular label

can spread as a function of the geodesic distance from itsorigin This additional parameter adds in extra uncer-tainties to the algorithm but may encourage a strongerlocal community to form before a large cluster start todominate Ideally the selection of δ can even be adaptiveto current number of iteration the neighbourhood of thenode concerned and perhaps some a priori network pa-rameters We investigate the use of varying δ in the nextsection and assume here a constant value for δ Notethat this setting may induce a negative feedback loopwe therefore let δ = 0 if the selected label is equal to thecurrent label

As discussed modularity has been widely used in theliterature as a metric to contrast the community detec-tion capabilities on real world networks between differ-ent algorithms Whilst high modularity indicates a sig-nificant modularised structure over a randomised graphof the network concerned the correspondence betweenhigh modularity and accurately partitioned communitiesis not well understood due to the resolution limit of mod-ularity Here we attempt to contrast the behaviours ofthe algorithms on the OSN based on modularity but shallnot draw strong conclusions on the accuracies of the com-munity detection due to the above reasons In Section Va novel benchmark proposed by Lancichinetti et al [19]capable of revealing resolution limit of modularity-basedalgorithms is used for further comparisons

Fig 8 depicts the average performance curves over 5runs for both versions of the algorithm applying hop at-tenuation and preferential linkage The results suggestthat on both implementations a slight but not too higha preference on high-degree nodes (m gt 0) can speed upthe process for achieving peak modularity on the OSNnetwork but also gives rise to a steeper drop as shown inFig 8(a) We believe however different magnitudes ofm simply restrict the choice of nodes to different subsetssome of which may contribute to a ldquoglobal pandemicrdquoand some may not By simply using the degree of a nodemay not be a heuristic generic enough for different net-works Further study is required to understand if at allpossible how to deduce a generic preference on neigh-bourhood labels every iteration without resorting to aglobal metric which is costly Nonetheless we show thatgiving preference to certain nodes over others when de-ciding between labels to accept can be beneficial in termsof number of iterations to achieve maximum modularity

Looking at hop attenuation we find that the applica-tion of δ indeed deters the occurrence of the ldquomonsterclustersrdquo as expected and thereby preventing the modu-larity drop after certain iterations But it was also ob-vious that high hop attenuation prevented the healthygrowing of the communities and restricted the increasein modularity (cf Fig 8(b)8(e)) Moreover we con-jecture that hop attenuation restrains the spread of thelabel from an arbitrary center and thereby the formationof circular clusters This suppression in forming non-circular clusters may lead to the suboptimal performancein terms of modularity as shown in the asynchronous

6

SynchronousM

odul

arity

Q

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(a)

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(b)

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(c)

Asynchronous

Modula

rity

Q

Iteration

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(d)

Iteration

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(e)

Iteration

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(f)

FIG 8 Average performance comparisons of the synchronous and asynchronous implementations with varying δ and m over5 Runs

case (Fig 8(e)) Finally from Fig 8(c) and 8(f) wesee that combining both parameters on average bene-fits both versions of the algorithm in achieving a commu-nity partitioning of high modularity more efficiently andconsistently

B Hierarchical amp Overlapping Communities

Communities in certain networks are known to be hier-archical For instance students in the same classes oftenform some strong local communities while these commu-nities say of the same school in turn form a larger butrelatively weaker community As discussed in Section IImost CNM-based algorithms are inherently hierarchicalsince communities are agglomerated by greedy local op-timization of modularity gain

We present two simple modifications to the originalmethod to enable the detection of hierarchical commu-nities Firstly let us consider the application of hop at-tenuation on label propagation Suppose we impose avery high hop attenuation at the beginning we expectcommunities of small diameter to form If we then grad-ually relax the attenuation value we should expect thesesmall communities to merge into larger ones In order toachieve this we modify eq (3) as follows

sprimei(Li) = 1 minus δ(dG(O(Li) i)) (4)

where

dG(O(Li) i) = 1 + miniprimeisinNi(Li)

dG(O(Li) iprime) (5)

Essentially instead of receiving the current hop scoresfrom the neighbourhood and carry out a subtraction thescore is now determined by the actual geodesic distance(dG) from the label Lrsquos origin denoted by O(L) and thefunction δ This gives greater flexibility of δ in terms ofgeodesic distances and can facilitate iteration-dependenthop attenuation as required here with slight extra com-putation cost

Our second proposal is inspired from [16] where wecan similarly treat newly combined communities as a sin-gle node and use the number of inter-community edgesas the weight of edges between these ldquofresh condensedrdquonodes Instead of doing this every iteration we can applycertain amount of hop attenuation or hard limit in termsof the diameter of the community and do this after anequilibrium is reached

Fig 9 gives an illustration of the first modificationapplied on a subgraph on the OSN Note that this mod-ification depends very much on the initial labelling of

7

nodes because it determines the initial centers of thesesmall communities

FIG 9 (Color online) Community detection in the OSN(n=3000) by gradually decreasing hop attenuation (δ = 05at the top with Q = 064 δ = 0 at the bottom with Q = 078)Nodes with 3 or less neighbours are filtered to ease the visu-alisation

Another important question which was also briefly ad-dressed in [4] is the problem of overlapping communi-ties [20] ie nodes can often be considered a memberof different communities From previous sections we un-derstood that different asynchronous version of the al-gorithm is capable of generating very different results indifferent runs This is exactly how [4] suggested as a po-tential solution - to re-run the algorithm several timesIn a parallel environment however the results tend to bemuch less fluctuating An initial attempt was to increasethe number of labels passed each time between nodes toachieve a similar effect Preliminary experiments indi-cate limited success since this setting hampers the con-vergence process possibly due to the potential of latentlabels switching back and fro in the system Another pos-sibility is the exploit the fact that nodes on the borderof its community have different proportions (purity) ofneighbours from other communities We can potentiallyuse that as a measure of membership but this indeed mayonly be applicable to such boundary nodes

C Optimization

The individual inspection of every node particularlythose with many neighbours is a crucial factor in deter-mining the speed of the algorithm Putting aside efficientdata structures and prudent programming an obvious

Iteration

Nodes avoided (time saved) 806040

Abs Diff in Q 806040

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

FIG 10 The difference in modularity and speed of theoptimized modifications with the original

optimization we can do without much compromise on theperformance is to selectively update high degree nodesThe reader may have realised that after certain itera-tions it would be pointless to update certain nodes thatare well inside a cluster These nodes are surrounded bynodes with the same label which are unlikely to changefor the same reason We employ a simple purity mea-sure of neighbours to selectively update nodes that areon the borders of their communities In other words weonly update nodes whose number of neighbours sharingthe maximal label is less than a certain percentage In-deed small degree nodes are likely to be avoided in earlyiterations in this setting but their contributions to theoverall community structure and performance are almostinsignificant We carry out the modified algorithm withthresholds set at 100 (equivalent to the unmodified al-gorithm) 80 60 and 40 to examine the trade offbetween accuracy and speed

Figure 10 reveals that after the 1st iteration the ex-tra constraint will increasingly avoid updating nodes Asmore nodes settle in a more stable cluster increasinglyless amount of time will be required in an iteration In-terestingly even with a threshold as low as 40 the ab-solute difference in modularity compared to the originalsetting is reasonably small and we can see the overallrunning time can be significantly reduced

D Parallel amp Online Analysis

Clear advantages of label propagation include its easeto be parallelized and its potential online implementationin real time networks Since each node is required onlyto know information about its neighbours and updatesitself according to the common rules parallelism can beeasily achieved This brings us to another technical pointthat when the algorithm is completely parallelized evenwithout explicit synchronization it would tend to behave

8

like the synchronous version of the algorithm And this isthe key reason why we have stressed in the literature thatimproving both synchronous and asynchronous versionsof the algorithm are equally important

The running time in a parallel environment effectivelyreduces to k if there are Θ(n) machines This can beachieved in real world ubiquitous system such as a mo-bile ad-hoc network (MANET) or potentially on OSNsthemselves (if members are willing to contribute theircomputational power) in real time For instance socialinformation such as the community structure is known tobenefit routing in MANET [21] Moreover in such sce-narios the space requirement for storing link informationwould become decentralised and thus insignificant

On the same note we see great potential in adaptingthe algorithm for online community detection in real-timedynamic networks where the presence of nodes and edgesare constantly evolving The microscopic movements andintermittent presence of nodes contribute to changes interms of weights of the edges These in turn result in fivedistinct macroscopic behaviours of communities namelygrowth shrinkage union division and death of commu-nities The challenge indeed is to detect local changeswithout the need for a global update given limited com-putational resource or time constraint We believe labelpropagation is particularly suited in this paradigm andthus propose this as future work

V COMPARISONS

We first look at two relatively large and previouslystudied networks for comparisons These networks arerespectively the Amazon Purchasing Network analysedin [13] and the actor collaboration network [22] As donein [13] we assume all edges to be undirected to ease theanalysis With the added heuristics the algorithm is ableto perform within 5 of CNM and 10 of the adapta-tion by Danon Dıaz-Guilera and Arenas (CNM-DDA)[14] in terms of modularity (cf Table I) LPA how-ever achieves the result in a matter of minutes which isunparalleled by the above

For a more standardized comparison we turn to the re-cently proposed benchmark graphs by Lancichinetti etal [19] an extension to the well known GN benchmark[12] which incorporates more realistic scale-free degreeand cluster-size distributions We follow closely the im-plementation of the benchmark graphs as described in[19] and compare the original LPA with the improvedversion on the graphs of size 1000 and 5000 To contrastlabel propagation with general fast modularity maximi-sation algorithms we also run the benchmarks on theCNM algorithm

As shown in Fig 11 both implementations achieve su-perior accuracy over CNM in terms of normalised mutualinformation (NMI) even up to a mixing parameter of 06Interestingly the original method shows signs of failureat micro = 05 in the N = 1000 d = 50 benchmark graphs

(cf Fig 11(b)) We believe this corresponds to theformation of monster communities discussed in SectionIVA The number of nodes and the average degree ofthe benchmark graphs in effect dictate the number andsizes of the original communities generated The resultshence point out that denser and less modularized graphsare relatively prone to the formation of monster commu-nities However the application of hop attenuation asexemplified in Fig 11(b) greatly improves the overallperformance of LPA in such scenarios

Importantly as opposed to label propagation we cansee that CNM algorithmrsquos performance does not merelydepend on the mixing parameter but also the averagedegree of the network Resolution limit of modularitymaximization is reflected by CNMrsquos worse performancein graphs having a smaller average degree Although inmost configurations all algorithms expectedly manage touncover a modularity value of a similar magnitude thereal accuracy in terms of NMI does not follow Thisfinding corresponds to the notion in [17] that modularitymaximisation does not simply translate to actual com-munities

VI CONCLUSIONS

In this literature we have empirically analysed a scal-able efficient and accurate community detection algo-rithm We discussed the behaviours and emphasized theimportance of both the synchronous and asynchronousimplementations of the algorithm We suggested poten-tial heuristics that can be applied to improve its aver-age detection performance and adaptability Most im-portantly we contrasted the algorithm with modularity-gain based methods in terms of community detection ac-curacy and observed how it can be potentially appliedonline and concurrently in large-scale and real-time dy-namic networks

Understanding the dynamics of this algorithm wouldbe the major future work of this discipline before onedevises further heuristics to improve the algorithm Webelieve that each notion discussed in Section IV is wor-thy of further inspection An equally important point isto analyse mathematically or empirically on how to bestadapt the algorithm to different types of networks by theadded heuristics How do different network topologiesand models affect the algorithmrsquos convergent behaviourThese are all valuable questions to be investigated in fu-ture work

In summary we show that label propagation with theappropriate modifications is a more reliable and efficientmethod in detecting communities in large-scale networksthan popular existing methods We trust that with fur-ther understanding and analysis epidemic-based commu-nity detection would be of substantial value to the field

9

Network Size Directed Links Q(Claimed) Peak Q(Sync) Peak Q(Async)

Amazon Purchase(Marrsquo03) 409687 4929260 0745 [13] 0724 0727Actor Collaboration 374511 30052912 0528 [4] 0719 [14] 0642 0660

TABLE I The results correspond to the peak modularity achieved in 10 iterations or less with f = Deg and m = 01 and agradually decreasing δ as discussed in Section IV B

CNM LPA LPA-δ

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(a)N = 1000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06N

MI

Mixing parameter micro

(b)N = 1000 d = 50

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(c)N = 5000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(d)N = 5000 d = 50

FIG 11 Average performance comparisons between the three algorithms in the benchmark graphs with size N and averagedegree d Both versions of LPA here are asynchronous LPA-δ implements a gradually decreasing δ as discussed in SectionIVB All benchmark graphs have power-law degree and cluster-size distributions with exponent 3 and 2 For N = 1000 theresults are the average over 100 realisations for N = 5000 over 10 realisations

Acknowledgments

We thank Franco Bagnoli and Vito Latora for help-ful comments We are grateful to Eric Promislow for

providing us with the Amazon network data Networkvisualisations are carried out on Cytoscape [24] Thisproject is supported by EC IST SOCIALNETS - Grantagreement number 217141

[1] J Kleinberg in STOC rsquo00 Proceedings of the thirty-

second annual ACM symposium on Theory of computing

(ACM New York NY USA 2000) pp 163ndash170 ISBN

10

1-58113-184-4[2] D J Watts and S H Strogatz Nature (London) 393

440 (1998)[3] A Mislove M Marcon K P Gummadi P Druschel

and B Bhattacharjee in IMC rsquo07 Proceedings of the 7th

ACM SIGCOMM conference on Internet measurement

(ACM New York NY USA 2007) pp 29ndash42[4] U N Raghavan R Albert and S Kumara Phys Rev

E 76 036106 (pages 11) (2007)[5] D Lusseau and M E J Newman Proc R Soc London

B 271 S477 (2004)[6] G W Flake S Lawrence et al IEEE Computer 35 66

(2002)[7] E Ravasz A L Somera D A Mongru Z N Oltvai

and A L Barabasi Science 297 1551 (2002)[8] M E J Newman Eur Phys J B 38 321 (2004)[9] L Danon J Duch et al J Stat Mech p P09008

(2005)[10] S Boccaletti V Latora Y Moreno M Chavez and D-

U Hwang Phys Rep 424 175 (2006)[11] F Radicchi C Castellano F Cecconi V Loreto and

D Parisi PNAS 101 2658 (2004)[12] M E Newman and M Girvan Phys Rev E 69 026113

(2004)[13] A Clauset M E J Newman and C Moore Phys Rev

E 70 066111 (2004)[14] L Danon A Dıaz-Guilera and A Arenas J Stat Mech

2006 P11010 (2006)[15] K Wakita and T Tsurumi in WWW rsquo07 Proceedings

of the 16th international conference on World Wide Web

(ACM New York NY USA 2007) pp 1275ndash1276[16] V D Blondel J-L Guillaume R Lambiotte and

E Lefebvre J Stat Mech 10 P10008 (2008) 08030476[17] S Fortunato and M Barthelemy PNAS 104 36 (2007)[18] R Albert and A-L Barabasi Rev Mod Phys 74 47

(2002)[19] A Lancichinetti S Fortunato and F Radicchi Phys

Rev E 78 046110 (2008)[20] G Palla I Derenyi et al Nature 435 814 (2005) ISSN

0028-0836[21] P Hui J Crowcroft and E Yoneki in MobiHoc rsquo08

Proceedings of the 9th ACM international symposium on

Mobile ad hoc networking and computing (ACM NewYork NY USA 2008) pp 241ndash250 ISBN 978-1-60558-073-9

[22] A-L Barabasi and R Albert Science 286 509 (1999)[23] See the Facebookcom Six Degrees Project[24] httpwwwcytoscapeorg

2

ferent types of networks (Section IV) Section V givesdetailed comparisons between the label propagation al-gorithm (LPA) and fast modularity-optimization algo-rithms We conclude the paper with future directions ofresearch in Section VI

II RELATED WORK

Community detection in complex networks has at-tracted ample attention in recent years Apart fromOSNs researchers have engaged in community analysis invarious types of networks In the case of the Internet ex-amples of communities are found in autonomous systems[5] and indeed web pages of similar topic [6] In biologi-cal networks it is widely believed that modular structureplays a crucial role in biological functions [7] Relatedliteratures such as [8 9 10] may serve as introductoryreading which also include methodological overviews andcomparative studies of different algorithms

The detection of community structure in a network isgenerally intended as a procedure for mapping the net-work into a tree [11] known as dendrogram In this treethe leaves are the nodes and the branches join them or(at a higher level) groups of them thus identifying a hier-archy of communities Nodes can either be agglomeratedsuccessively starting from single nodes (agglomerative)or the whole network can be recursively partitioned (divi-sive) Newman and Girvan introduced a seminal divisivealgorithm in which the selection of the edge to be cutis based on the value of its edge betweenness [12] thenumber of shortest paths between all node pairs runningthrough it It is clear that when a graph is made oftightly bound clusters each loosely interconnected allshortest paths between nodes in different clusters haveto go through the few inter-cluster connections whichtherefore have a large betweenness value Recursivelyremoving these large betweenness edges would partitionthe network into communities of different sizes

Quantitatively however we need a metric to measurehow well the community detection is progressing other-wise most algorithms would either continue until everynode is split into a single community or all join togetherinto one Newman and Girvan proposed in [12] a mea-sure of the goodness of communities called modularityfor the set of uncovered communities C the modularityis defined to be

Q =sum

cisinC

(

Ic

Eminus

(

2Ic + Oc

2E

)2)

(1)

where Ic indicates the total number of internal edges thathave both ends in c Oc is the number of outgoing edgesthat have only one end in c and E is the total numberof edges This measure essentially compares the numberof links inside a given module with the expected valuefor a randomized graph of the same size and same degreesequence

The concept of modularity has gained such popular-ity that it has not only been used as a measure ofthe community partitioning of a network but also asa key fitness indicator in various community detectionalgorithms The algorithm proposed by Clauset New-man and Moore (CNM) [13] which greedily combinesnodescommunities to optimize modularity gain is per-haps to date one the most popular algorithms in detect-ing communities in relatively large scale networks Inthe time when CNM was proposed it was then the onlyalgorithm capable of community detection on networksof size 500000 in a matter of hours Throughout theyears several variations of the CNM have been proposed[14 15 16] Most of them concentrate on more efficientdata structures as well as modularity gain heuristics toimprove the overall performance A latest adaptation[16] that treats newly combined communities as a singlenode after each iteration is able to identify communitystructure on a network containing 1 billion edges in amatter of hours

It is vital however to understand that modularityis not a scale-invariant measure and hence by blindlyrelying on its maximization detection of communitiessmaller than a certain size is impossible This is famouslyknown as the resolution limit [17] of modularity basedalgorithms Since LPA does not involve modularity op-timization its community detection capability is scale-independent and therefore not affected by the resolutionlimit as will be shown in Section V

III DISCUSSION

Here we give a brief discussion on the characteristicsof the algorithm as well as some preliminary results ap-plying the algorithm on the OSN described above

A A ldquonear linear timerdquo algorithm

One can consider the label spreading as a simplifiedbut specific case of epidemic spreading where all indi-viduals are considered infectious with their own uniquedisease Each person is infected by a disease that isprevalent in his or her neighbourhood Fig 2 depictsthe labelling convergence seen in a 4-clique The numberof clusters monotonically decreases each iteration as cer-tain labels become extinct due to domination by otherlabels With certain rare and exceptional cases the la-belling self-organises to an unsupervised equilibrium ef-ficiently

As suggested in [4] certain properties may prevent theequilibrium from occurring For instance a network witha bipartite structure might render the system to oscil-late if the algorithm is run synchronously ie all nodesare updated together only after they have selected theirmaximal labels Running the algorithm asynchronously

in a randomized order every iteration as suggested in

3

FIG 2 Each node is looked at in a certain order and a newlabel is selected The above shows how nodes in a 4-cliqueself-organise into one single community in one iteration

the paper may result in less definitive results but solvesthe problem It was also suggested that a node that hastwo equally maximal labels to choose from may fail toconverge and an extra stopping criterion to prevent theswitching of label would have to be in place It is how-ever noted in our implementation that including the con-cerned label itself into the maximal label consideration ef-fectively avoids all the above non-convergent behavioursand the requirement for an extra stopping criterion

In one iteration each nodersquos neighbours are examinedand the maximal label is chosen The running time ofthis algorithm is therefore O(knd) where k is the numberof iterations n the number of nodes and d the averagedegree of nodes Note that nd can also be describedby m the number of edges The number of iterationsrequired k is dependent on the stopping criterion but isnot very well understood [4] suggested that the numberof iterations required is independent to the number ofnodes and that after 5 iterations 95 of their nodes arealready accurately clustered

Since labels can hardly affect nodes outside their lo-cal densely connected substructures the convergent be-haviour should be dependent on these substructuresrather than the whole network This is confirmed bypreliminary testing and directs us to look at substruc-tures which can ultimately become the community Ex-periments show that the average number of iterations re-quired for the labelling to converge (no change in labels)in an N -clique for the asynchronous and synchronous im-plementations are 21 and 36 respectively highly inde-pendent of N To further investigate the average con-vergent behaviour on a substructure we look at Fig 3which summarises the relationship between number of it-erations required before convergence k to the pairwiseconnectivity p that controls the edge density in a ran-dom graph of size N (where p = 1 corresponds to theN -clique)

In both implementations we see that k remains fairlyconstant over both N and p until p reaches a certainthreshold which when reached we begin to see an inversedependence between N and k The overall averages ofasynchronous and synchronous implementations in thiscase are 28 and 52

Let us however consider another simple but non-random topology Suppose we start off with an N -Clique

SyncAsync

12345678910

N (x 100)

01 02 03 04 05 06 07 08 09 1

p

2 3 4 5 6 7 8 9

10 11

k

FIG 3 The above plots show the number of iterations re-quired before convergence for both the synchronous and asyn-chronous implementations on a random graph of size N withprobability of pairwise connection p All values here are av-eraged over 100 realisations

at each jth construction the graph is grown by connect-ing the N minusj most recently joined nodes to the new node(cf Fig 4)

FIG 4 This substructure is constructed on an N-clique N =25 by attaching each new node labelled l N lt l lt 2N toexisting nodes l minus 1 2(l minus N) thus contains 49 (2N minus 1)nodes and 600 (N(N minus 1)) edges

These structures by construction will converge into asingle community by LPA Without worrying about howabundant such patterns are in real world communitieswe look at the convergent behaviour shown in Fig 5The trend clearly reveals that k grows logarithmicallywith respect to N We therefore suggest the possibleworst case of k of the order of O(log N) where N is thesize of the largest substructure with a topology similarto the above Indeed we anticipate real world socialnetworks to contain highly heterogeneous substructureswhich may be intricately connected to affect each otherrsquos

4

0

5

10

15

20

25

1 10 100 1000

k

N

SyncAsync

FIG 5 The relationships between the number of iterationsrequired before convergence k of both implementations tothe size N of the aforementioned structure All values hereare averaged over 100 realisations

convergence We thus consider the understanding of theconvergent behaviour in large complex networks such asOSNs as a direction for further investigation

B Community Detection in OSN

We carry out community detection on the aforemen-tioned OSN using a desktop PC with 4GB ram and a 24GHz quad-core processor running 32-bit Java VM 16Due to limited memory we restrict the number of nodesto the first million Since the order of nodes in the origi-nal data corresponds to that of a breath-first web crawl-ing this way of ldquocutting offrdquo the data is equivalent to ex-tracting a snowball sample As discussed in [3] snowballmethods are known to over-sample high-degree nodesunder-sample low-degree ones and overestimate the av-erage node degree This is seen by the higher average de-gree of the subgraph 250 compared to 106 of the originalgraph Nonetheless since the purpose of this literatureis to evaluate the algorithm on large-scale networks thesampled network satisfies our requirements The sam-pled subgraph contains 1000000 nodes and 58793458directed links Convergent behaviours of the two differ-ent implementations are shown in Fig 6

A crucial point is that in a complex network as largeas this the so called ldquoconvergencerdquo does not necessarilyyield an optimal result in terms of modularity For exam-ple we see the asynchronous implementation merely tookon average 5 iterations to achieve a maximum modularitybut has highly volatile results in different runs as depictedby the shaded area in the figure On the other hand thesynchronous implementation achieved maximum modu-larity much slower than the asynchronous version butits performance on average is much more stable (its per-formance range is thus omitted) The performances of

Mod

ular

ity Q

No

of C

omm

uniti

es

Iteration

Range(Async)Avg Q(Async)Avg Q(Sync)

Avg No of Comm(Async)Avg No of Comm(Sync)

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

10000

20000

30000

40000

50000

FIG 6 Average performances of asynchronous and syn-chronous LPA Values are averaged over 5 Runs Shaded areadenotes the range of the performances of asynchronous imple-mentation

these two different implementations are equally impor-tant to be understood and utilised Further discussionson the implications of these implementations and theirutilizations are given in Section IV

Each single-threaded iteration finishes in a matter oftens of second and thus depending on the stopping cri-terion it can take as little as 8 to 10 minutes up to peakperformance Extrapolating the time required with re-spect to the number of edges the algorithm without anyoptimization should be able to detect communities on agraph with 1 billion edges in less than 180 minutes in amagnitude similar to that in [16]

Fig 7 shows the distribution of communityclustersize collected by a specific run of the asynchronous ver-sion of the algorithm when the modularity peaked at0638 The size distribution of communities within theOSN follows a 2-part power law distribution in the com-plementary CDF with an estimated coefficient of 11The interested reader is referred to [10 18] for discus-sions on the characteristics of different networks

IV A MORE RELIABLE AND EFFICIENT

ALGORITHM

In this section we discuss potential modifications tothe algorithm to increase its reliability functionality andcomputational efficiency

A Hop Attenuation amp Node Preference

Due to the ldquoepidemicrdquo nature of the algorithm a majorlimitation of the algorithm is noted where certain ldquolabelepidemicrdquo manages to ldquoplaguerdquo a large amount of nodesTo be exact in some runs a certain community of size

5

P(S

gts)

Community Size s

00001

0001

001

01

1

1 10 100 1000 10000 100000 1e+006

FIG 7 The community-size distribution of communities un-covered by the algorithm which follows a 2-part power law

over 500000 (50 of the number of nodes) is formedmdashasopposed to the thousand other counterparts averagelysized in a magnitude of 100smdashgreatly contributing tomodularity drop after the peak We conjecture that thisis partially due to the asynchronous nature of the algo-rithm and the initial formation of communities wherecertain communities do not form strong enough links toprevent a foreign ldquoepidemicrdquo to sweep through Furtherexperiments confirm that the synchronous version of thealgorithm slows down the formation of such ldquomonsterrdquocommunities but do not prevent them

We propose an extension to this algorithm by addinga score associated with the label which decreases as ittraverses from its origin A node is initially given a scoreof 10 for its label After a node i has collected fromits neighbourhood Ni all the respective labels and thescores the calculation of the new maximal label Lprime canbe generalised by

Lprimei = argmax

L

sum

iprimeisinNi

siprime(Liprime) middot f(iprime)mmiddot wiprimei (2)

where Li is the label of node i si(L) is the hop scoreof label L in i wiprimei is the weight of the edge betweeniprime and i (we sum the weights in both directions if thegraph is directed) and f(i) is any arbitrary comparablecharacteristic for any node i For instance if we definef(i) = Deg(i) when m gt 0 more preference is given tonode with more neighbours m lt 0 less The final stepis to assign a new attenuated score sprime to the new label Lprime

of i by subtracting hop attenuation δ 0 lt δ lt 1

sprimei(Lprimei) =

(

maxiprimeisinNi(Lprime

i)si(Liprime)

)

minus δ (3)

where Ni(L) is the set of neighbours of i that has la-bel L The value δ governs how far a particular label

can spread as a function of the geodesic distance from itsorigin This additional parameter adds in extra uncer-tainties to the algorithm but may encourage a strongerlocal community to form before a large cluster start todominate Ideally the selection of δ can even be adaptiveto current number of iteration the neighbourhood of thenode concerned and perhaps some a priori network pa-rameters We investigate the use of varying δ in the nextsection and assume here a constant value for δ Notethat this setting may induce a negative feedback loopwe therefore let δ = 0 if the selected label is equal to thecurrent label

As discussed modularity has been widely used in theliterature as a metric to contrast the community detec-tion capabilities on real world networks between differ-ent algorithms Whilst high modularity indicates a sig-nificant modularised structure over a randomised graphof the network concerned the correspondence betweenhigh modularity and accurately partitioned communitiesis not well understood due to the resolution limit of mod-ularity Here we attempt to contrast the behaviours ofthe algorithms on the OSN based on modularity but shallnot draw strong conclusions on the accuracies of the com-munity detection due to the above reasons In Section Va novel benchmark proposed by Lancichinetti et al [19]capable of revealing resolution limit of modularity-basedalgorithms is used for further comparisons

Fig 8 depicts the average performance curves over 5runs for both versions of the algorithm applying hop at-tenuation and preferential linkage The results suggestthat on both implementations a slight but not too higha preference on high-degree nodes (m gt 0) can speed upthe process for achieving peak modularity on the OSNnetwork but also gives rise to a steeper drop as shown inFig 8(a) We believe however different magnitudes ofm simply restrict the choice of nodes to different subsetssome of which may contribute to a ldquoglobal pandemicrdquoand some may not By simply using the degree of a nodemay not be a heuristic generic enough for different net-works Further study is required to understand if at allpossible how to deduce a generic preference on neigh-bourhood labels every iteration without resorting to aglobal metric which is costly Nonetheless we show thatgiving preference to certain nodes over others when de-ciding between labels to accept can be beneficial in termsof number of iterations to achieve maximum modularity

Looking at hop attenuation we find that the applica-tion of δ indeed deters the occurrence of the ldquomonsterclustersrdquo as expected and thereby preventing the modu-larity drop after certain iterations But it was also ob-vious that high hop attenuation prevented the healthygrowing of the communities and restricted the increasein modularity (cf Fig 8(b)8(e)) Moreover we con-jecture that hop attenuation restrains the spread of thelabel from an arbitrary center and thereby the formationof circular clusters This suppression in forming non-circular clusters may lead to the suboptimal performancein terms of modularity as shown in the asynchronous

6

SynchronousM

odul

arity

Q

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(a)

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(b)

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(c)

Asynchronous

Modula

rity

Q

Iteration

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(d)

Iteration

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(e)

Iteration

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(f)

FIG 8 Average performance comparisons of the synchronous and asynchronous implementations with varying δ and m over5 Runs

case (Fig 8(e)) Finally from Fig 8(c) and 8(f) wesee that combining both parameters on average bene-fits both versions of the algorithm in achieving a commu-nity partitioning of high modularity more efficiently andconsistently

B Hierarchical amp Overlapping Communities

Communities in certain networks are known to be hier-archical For instance students in the same classes oftenform some strong local communities while these commu-nities say of the same school in turn form a larger butrelatively weaker community As discussed in Section IImost CNM-based algorithms are inherently hierarchicalsince communities are agglomerated by greedy local op-timization of modularity gain

We present two simple modifications to the originalmethod to enable the detection of hierarchical commu-nities Firstly let us consider the application of hop at-tenuation on label propagation Suppose we impose avery high hop attenuation at the beginning we expectcommunities of small diameter to form If we then grad-ually relax the attenuation value we should expect thesesmall communities to merge into larger ones In order toachieve this we modify eq (3) as follows

sprimei(Li) = 1 minus δ(dG(O(Li) i)) (4)

where

dG(O(Li) i) = 1 + miniprimeisinNi(Li)

dG(O(Li) iprime) (5)

Essentially instead of receiving the current hop scoresfrom the neighbourhood and carry out a subtraction thescore is now determined by the actual geodesic distance(dG) from the label Lrsquos origin denoted by O(L) and thefunction δ This gives greater flexibility of δ in terms ofgeodesic distances and can facilitate iteration-dependenthop attenuation as required here with slight extra com-putation cost

Our second proposal is inspired from [16] where wecan similarly treat newly combined communities as a sin-gle node and use the number of inter-community edgesas the weight of edges between these ldquofresh condensedrdquonodes Instead of doing this every iteration we can applycertain amount of hop attenuation or hard limit in termsof the diameter of the community and do this after anequilibrium is reached

Fig 9 gives an illustration of the first modificationapplied on a subgraph on the OSN Note that this mod-ification depends very much on the initial labelling of

7

nodes because it determines the initial centers of thesesmall communities

FIG 9 (Color online) Community detection in the OSN(n=3000) by gradually decreasing hop attenuation (δ = 05at the top with Q = 064 δ = 0 at the bottom with Q = 078)Nodes with 3 or less neighbours are filtered to ease the visu-alisation

Another important question which was also briefly ad-dressed in [4] is the problem of overlapping communi-ties [20] ie nodes can often be considered a memberof different communities From previous sections we un-derstood that different asynchronous version of the al-gorithm is capable of generating very different results indifferent runs This is exactly how [4] suggested as a po-tential solution - to re-run the algorithm several timesIn a parallel environment however the results tend to bemuch less fluctuating An initial attempt was to increasethe number of labels passed each time between nodes toachieve a similar effect Preliminary experiments indi-cate limited success since this setting hampers the con-vergence process possibly due to the potential of latentlabels switching back and fro in the system Another pos-sibility is the exploit the fact that nodes on the borderof its community have different proportions (purity) ofneighbours from other communities We can potentiallyuse that as a measure of membership but this indeed mayonly be applicable to such boundary nodes

C Optimization

The individual inspection of every node particularlythose with many neighbours is a crucial factor in deter-mining the speed of the algorithm Putting aside efficientdata structures and prudent programming an obvious

Iteration

Nodes avoided (time saved) 806040

Abs Diff in Q 806040

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

FIG 10 The difference in modularity and speed of theoptimized modifications with the original

optimization we can do without much compromise on theperformance is to selectively update high degree nodesThe reader may have realised that after certain itera-tions it would be pointless to update certain nodes thatare well inside a cluster These nodes are surrounded bynodes with the same label which are unlikely to changefor the same reason We employ a simple purity mea-sure of neighbours to selectively update nodes that areon the borders of their communities In other words weonly update nodes whose number of neighbours sharingthe maximal label is less than a certain percentage In-deed small degree nodes are likely to be avoided in earlyiterations in this setting but their contributions to theoverall community structure and performance are almostinsignificant We carry out the modified algorithm withthresholds set at 100 (equivalent to the unmodified al-gorithm) 80 60 and 40 to examine the trade offbetween accuracy and speed

Figure 10 reveals that after the 1st iteration the ex-tra constraint will increasingly avoid updating nodes Asmore nodes settle in a more stable cluster increasinglyless amount of time will be required in an iteration In-terestingly even with a threshold as low as 40 the ab-solute difference in modularity compared to the originalsetting is reasonably small and we can see the overallrunning time can be significantly reduced

D Parallel amp Online Analysis

Clear advantages of label propagation include its easeto be parallelized and its potential online implementationin real time networks Since each node is required onlyto know information about its neighbours and updatesitself according to the common rules parallelism can beeasily achieved This brings us to another technical pointthat when the algorithm is completely parallelized evenwithout explicit synchronization it would tend to behave

8

like the synchronous version of the algorithm And this isthe key reason why we have stressed in the literature thatimproving both synchronous and asynchronous versionsof the algorithm are equally important

The running time in a parallel environment effectivelyreduces to k if there are Θ(n) machines This can beachieved in real world ubiquitous system such as a mo-bile ad-hoc network (MANET) or potentially on OSNsthemselves (if members are willing to contribute theircomputational power) in real time For instance socialinformation such as the community structure is known tobenefit routing in MANET [21] Moreover in such sce-narios the space requirement for storing link informationwould become decentralised and thus insignificant

On the same note we see great potential in adaptingthe algorithm for online community detection in real-timedynamic networks where the presence of nodes and edgesare constantly evolving The microscopic movements andintermittent presence of nodes contribute to changes interms of weights of the edges These in turn result in fivedistinct macroscopic behaviours of communities namelygrowth shrinkage union division and death of commu-nities The challenge indeed is to detect local changeswithout the need for a global update given limited com-putational resource or time constraint We believe labelpropagation is particularly suited in this paradigm andthus propose this as future work

V COMPARISONS

We first look at two relatively large and previouslystudied networks for comparisons These networks arerespectively the Amazon Purchasing Network analysedin [13] and the actor collaboration network [22] As donein [13] we assume all edges to be undirected to ease theanalysis With the added heuristics the algorithm is ableto perform within 5 of CNM and 10 of the adapta-tion by Danon Dıaz-Guilera and Arenas (CNM-DDA)[14] in terms of modularity (cf Table I) LPA how-ever achieves the result in a matter of minutes which isunparalleled by the above

For a more standardized comparison we turn to the re-cently proposed benchmark graphs by Lancichinetti etal [19] an extension to the well known GN benchmark[12] which incorporates more realistic scale-free degreeand cluster-size distributions We follow closely the im-plementation of the benchmark graphs as described in[19] and compare the original LPA with the improvedversion on the graphs of size 1000 and 5000 To contrastlabel propagation with general fast modularity maximi-sation algorithms we also run the benchmarks on theCNM algorithm

As shown in Fig 11 both implementations achieve su-perior accuracy over CNM in terms of normalised mutualinformation (NMI) even up to a mixing parameter of 06Interestingly the original method shows signs of failureat micro = 05 in the N = 1000 d = 50 benchmark graphs

(cf Fig 11(b)) We believe this corresponds to theformation of monster communities discussed in SectionIVA The number of nodes and the average degree ofthe benchmark graphs in effect dictate the number andsizes of the original communities generated The resultshence point out that denser and less modularized graphsare relatively prone to the formation of monster commu-nities However the application of hop attenuation asexemplified in Fig 11(b) greatly improves the overallperformance of LPA in such scenarios

Importantly as opposed to label propagation we cansee that CNM algorithmrsquos performance does not merelydepend on the mixing parameter but also the averagedegree of the network Resolution limit of modularitymaximization is reflected by CNMrsquos worse performancein graphs having a smaller average degree Although inmost configurations all algorithms expectedly manage touncover a modularity value of a similar magnitude thereal accuracy in terms of NMI does not follow Thisfinding corresponds to the notion in [17] that modularitymaximisation does not simply translate to actual com-munities

VI CONCLUSIONS

In this literature we have empirically analysed a scal-able efficient and accurate community detection algo-rithm We discussed the behaviours and emphasized theimportance of both the synchronous and asynchronousimplementations of the algorithm We suggested poten-tial heuristics that can be applied to improve its aver-age detection performance and adaptability Most im-portantly we contrasted the algorithm with modularity-gain based methods in terms of community detection ac-curacy and observed how it can be potentially appliedonline and concurrently in large-scale and real-time dy-namic networks

Understanding the dynamics of this algorithm wouldbe the major future work of this discipline before onedevises further heuristics to improve the algorithm Webelieve that each notion discussed in Section IV is wor-thy of further inspection An equally important point isto analyse mathematically or empirically on how to bestadapt the algorithm to different types of networks by theadded heuristics How do different network topologiesand models affect the algorithmrsquos convergent behaviourThese are all valuable questions to be investigated in fu-ture work

In summary we show that label propagation with theappropriate modifications is a more reliable and efficientmethod in detecting communities in large-scale networksthan popular existing methods We trust that with fur-ther understanding and analysis epidemic-based commu-nity detection would be of substantial value to the field

9

Network Size Directed Links Q(Claimed) Peak Q(Sync) Peak Q(Async)

Amazon Purchase(Marrsquo03) 409687 4929260 0745 [13] 0724 0727Actor Collaboration 374511 30052912 0528 [4] 0719 [14] 0642 0660

TABLE I The results correspond to the peak modularity achieved in 10 iterations or less with f = Deg and m = 01 and agradually decreasing δ as discussed in Section IV B

CNM LPA LPA-δ

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(a)N = 1000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06N

MI

Mixing parameter micro

(b)N = 1000 d = 50

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(c)N = 5000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(d)N = 5000 d = 50

FIG 11 Average performance comparisons between the three algorithms in the benchmark graphs with size N and averagedegree d Both versions of LPA here are asynchronous LPA-δ implements a gradually decreasing δ as discussed in SectionIVB All benchmark graphs have power-law degree and cluster-size distributions with exponent 3 and 2 For N = 1000 theresults are the average over 100 realisations for N = 5000 over 10 realisations

Acknowledgments

We thank Franco Bagnoli and Vito Latora for help-ful comments We are grateful to Eric Promislow for

providing us with the Amazon network data Networkvisualisations are carried out on Cytoscape [24] Thisproject is supported by EC IST SOCIALNETS - Grantagreement number 217141

[1] J Kleinberg in STOC rsquo00 Proceedings of the thirty-

second annual ACM symposium on Theory of computing

(ACM New York NY USA 2000) pp 163ndash170 ISBN

10

1-58113-184-4[2] D J Watts and S H Strogatz Nature (London) 393

440 (1998)[3] A Mislove M Marcon K P Gummadi P Druschel

and B Bhattacharjee in IMC rsquo07 Proceedings of the 7th

ACM SIGCOMM conference on Internet measurement

(ACM New York NY USA 2007) pp 29ndash42[4] U N Raghavan R Albert and S Kumara Phys Rev

E 76 036106 (pages 11) (2007)[5] D Lusseau and M E J Newman Proc R Soc London

B 271 S477 (2004)[6] G W Flake S Lawrence et al IEEE Computer 35 66

(2002)[7] E Ravasz A L Somera D A Mongru Z N Oltvai

and A L Barabasi Science 297 1551 (2002)[8] M E J Newman Eur Phys J B 38 321 (2004)[9] L Danon J Duch et al J Stat Mech p P09008

(2005)[10] S Boccaletti V Latora Y Moreno M Chavez and D-

U Hwang Phys Rep 424 175 (2006)[11] F Radicchi C Castellano F Cecconi V Loreto and

D Parisi PNAS 101 2658 (2004)[12] M E Newman and M Girvan Phys Rev E 69 026113

(2004)[13] A Clauset M E J Newman and C Moore Phys Rev

E 70 066111 (2004)[14] L Danon A Dıaz-Guilera and A Arenas J Stat Mech

2006 P11010 (2006)[15] K Wakita and T Tsurumi in WWW rsquo07 Proceedings

of the 16th international conference on World Wide Web

(ACM New York NY USA 2007) pp 1275ndash1276[16] V D Blondel J-L Guillaume R Lambiotte and

E Lefebvre J Stat Mech 10 P10008 (2008) 08030476[17] S Fortunato and M Barthelemy PNAS 104 36 (2007)[18] R Albert and A-L Barabasi Rev Mod Phys 74 47

(2002)[19] A Lancichinetti S Fortunato and F Radicchi Phys

Rev E 78 046110 (2008)[20] G Palla I Derenyi et al Nature 435 814 (2005) ISSN

0028-0836[21] P Hui J Crowcroft and E Yoneki in MobiHoc rsquo08

Proceedings of the 9th ACM international symposium on

Mobile ad hoc networking and computing (ACM NewYork NY USA 2008) pp 241ndash250 ISBN 978-1-60558-073-9

[22] A-L Barabasi and R Albert Science 286 509 (1999)[23] See the Facebookcom Six Degrees Project[24] httpwwwcytoscapeorg

3

FIG 2 Each node is looked at in a certain order and a newlabel is selected The above shows how nodes in a 4-cliqueself-organise into one single community in one iteration

the paper may result in less definitive results but solvesthe problem It was also suggested that a node that hastwo equally maximal labels to choose from may fail toconverge and an extra stopping criterion to prevent theswitching of label would have to be in place It is how-ever noted in our implementation that including the con-cerned label itself into the maximal label consideration ef-fectively avoids all the above non-convergent behavioursand the requirement for an extra stopping criterion

In one iteration each nodersquos neighbours are examinedand the maximal label is chosen The running time ofthis algorithm is therefore O(knd) where k is the numberof iterations n the number of nodes and d the averagedegree of nodes Note that nd can also be describedby m the number of edges The number of iterationsrequired k is dependent on the stopping criterion but isnot very well understood [4] suggested that the numberof iterations required is independent to the number ofnodes and that after 5 iterations 95 of their nodes arealready accurately clustered

Since labels can hardly affect nodes outside their lo-cal densely connected substructures the convergent be-haviour should be dependent on these substructuresrather than the whole network This is confirmed bypreliminary testing and directs us to look at substruc-tures which can ultimately become the community Ex-periments show that the average number of iterations re-quired for the labelling to converge (no change in labels)in an N -clique for the asynchronous and synchronous im-plementations are 21 and 36 respectively highly inde-pendent of N To further investigate the average con-vergent behaviour on a substructure we look at Fig 3which summarises the relationship between number of it-erations required before convergence k to the pairwiseconnectivity p that controls the edge density in a ran-dom graph of size N (where p = 1 corresponds to theN -clique)

In both implementations we see that k remains fairlyconstant over both N and p until p reaches a certainthreshold which when reached we begin to see an inversedependence between N and k The overall averages ofasynchronous and synchronous implementations in thiscase are 28 and 52

Let us however consider another simple but non-random topology Suppose we start off with an N -Clique

SyncAsync

12345678910

N (x 100)

01 02 03 04 05 06 07 08 09 1

p

2 3 4 5 6 7 8 9

10 11

k

FIG 3 The above plots show the number of iterations re-quired before convergence for both the synchronous and asyn-chronous implementations on a random graph of size N withprobability of pairwise connection p All values here are av-eraged over 100 realisations

at each jth construction the graph is grown by connect-ing the N minusj most recently joined nodes to the new node(cf Fig 4)

FIG 4 This substructure is constructed on an N-clique N =25 by attaching each new node labelled l N lt l lt 2N toexisting nodes l minus 1 2(l minus N) thus contains 49 (2N minus 1)nodes and 600 (N(N minus 1)) edges

These structures by construction will converge into asingle community by LPA Without worrying about howabundant such patterns are in real world communitieswe look at the convergent behaviour shown in Fig 5The trend clearly reveals that k grows logarithmicallywith respect to N We therefore suggest the possibleworst case of k of the order of O(log N) where N is thesize of the largest substructure with a topology similarto the above Indeed we anticipate real world socialnetworks to contain highly heterogeneous substructureswhich may be intricately connected to affect each otherrsquos

4

0

5

10

15

20

25

1 10 100 1000

k

N

SyncAsync

FIG 5 The relationships between the number of iterationsrequired before convergence k of both implementations tothe size N of the aforementioned structure All values hereare averaged over 100 realisations

convergence We thus consider the understanding of theconvergent behaviour in large complex networks such asOSNs as a direction for further investigation

B Community Detection in OSN

We carry out community detection on the aforemen-tioned OSN using a desktop PC with 4GB ram and a 24GHz quad-core processor running 32-bit Java VM 16Due to limited memory we restrict the number of nodesto the first million Since the order of nodes in the origi-nal data corresponds to that of a breath-first web crawl-ing this way of ldquocutting offrdquo the data is equivalent to ex-tracting a snowball sample As discussed in [3] snowballmethods are known to over-sample high-degree nodesunder-sample low-degree ones and overestimate the av-erage node degree This is seen by the higher average de-gree of the subgraph 250 compared to 106 of the originalgraph Nonetheless since the purpose of this literatureis to evaluate the algorithm on large-scale networks thesampled network satisfies our requirements The sam-pled subgraph contains 1000000 nodes and 58793458directed links Convergent behaviours of the two differ-ent implementations are shown in Fig 6

A crucial point is that in a complex network as largeas this the so called ldquoconvergencerdquo does not necessarilyyield an optimal result in terms of modularity For exam-ple we see the asynchronous implementation merely tookon average 5 iterations to achieve a maximum modularitybut has highly volatile results in different runs as depictedby the shaded area in the figure On the other hand thesynchronous implementation achieved maximum modu-larity much slower than the asynchronous version butits performance on average is much more stable (its per-formance range is thus omitted) The performances of

Mod

ular

ity Q

No

of C

omm

uniti

es

Iteration

Range(Async)Avg Q(Async)Avg Q(Sync)

Avg No of Comm(Async)Avg No of Comm(Sync)

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

10000

20000

30000

40000

50000

FIG 6 Average performances of asynchronous and syn-chronous LPA Values are averaged over 5 Runs Shaded areadenotes the range of the performances of asynchronous imple-mentation

these two different implementations are equally impor-tant to be understood and utilised Further discussionson the implications of these implementations and theirutilizations are given in Section IV

Each single-threaded iteration finishes in a matter oftens of second and thus depending on the stopping cri-terion it can take as little as 8 to 10 minutes up to peakperformance Extrapolating the time required with re-spect to the number of edges the algorithm without anyoptimization should be able to detect communities on agraph with 1 billion edges in less than 180 minutes in amagnitude similar to that in [16]

Fig 7 shows the distribution of communityclustersize collected by a specific run of the asynchronous ver-sion of the algorithm when the modularity peaked at0638 The size distribution of communities within theOSN follows a 2-part power law distribution in the com-plementary CDF with an estimated coefficient of 11The interested reader is referred to [10 18] for discus-sions on the characteristics of different networks

IV A MORE RELIABLE AND EFFICIENT

ALGORITHM

In this section we discuss potential modifications tothe algorithm to increase its reliability functionality andcomputational efficiency

A Hop Attenuation amp Node Preference

Due to the ldquoepidemicrdquo nature of the algorithm a majorlimitation of the algorithm is noted where certain ldquolabelepidemicrdquo manages to ldquoplaguerdquo a large amount of nodesTo be exact in some runs a certain community of size

5

P(S

gts)

Community Size s

00001

0001

001

01

1

1 10 100 1000 10000 100000 1e+006

FIG 7 The community-size distribution of communities un-covered by the algorithm which follows a 2-part power law

over 500000 (50 of the number of nodes) is formedmdashasopposed to the thousand other counterparts averagelysized in a magnitude of 100smdashgreatly contributing tomodularity drop after the peak We conjecture that thisis partially due to the asynchronous nature of the algo-rithm and the initial formation of communities wherecertain communities do not form strong enough links toprevent a foreign ldquoepidemicrdquo to sweep through Furtherexperiments confirm that the synchronous version of thealgorithm slows down the formation of such ldquomonsterrdquocommunities but do not prevent them

We propose an extension to this algorithm by addinga score associated with the label which decreases as ittraverses from its origin A node is initially given a scoreof 10 for its label After a node i has collected fromits neighbourhood Ni all the respective labels and thescores the calculation of the new maximal label Lprime canbe generalised by

Lprimei = argmax

L

sum

iprimeisinNi

siprime(Liprime) middot f(iprime)mmiddot wiprimei (2)

where Li is the label of node i si(L) is the hop scoreof label L in i wiprimei is the weight of the edge betweeniprime and i (we sum the weights in both directions if thegraph is directed) and f(i) is any arbitrary comparablecharacteristic for any node i For instance if we definef(i) = Deg(i) when m gt 0 more preference is given tonode with more neighbours m lt 0 less The final stepis to assign a new attenuated score sprime to the new label Lprime

of i by subtracting hop attenuation δ 0 lt δ lt 1

sprimei(Lprimei) =

(

maxiprimeisinNi(Lprime

i)si(Liprime)

)

minus δ (3)

where Ni(L) is the set of neighbours of i that has la-bel L The value δ governs how far a particular label

can spread as a function of the geodesic distance from itsorigin This additional parameter adds in extra uncer-tainties to the algorithm but may encourage a strongerlocal community to form before a large cluster start todominate Ideally the selection of δ can even be adaptiveto current number of iteration the neighbourhood of thenode concerned and perhaps some a priori network pa-rameters We investigate the use of varying δ in the nextsection and assume here a constant value for δ Notethat this setting may induce a negative feedback loopwe therefore let δ = 0 if the selected label is equal to thecurrent label

As discussed modularity has been widely used in theliterature as a metric to contrast the community detec-tion capabilities on real world networks between differ-ent algorithms Whilst high modularity indicates a sig-nificant modularised structure over a randomised graphof the network concerned the correspondence betweenhigh modularity and accurately partitioned communitiesis not well understood due to the resolution limit of mod-ularity Here we attempt to contrast the behaviours ofthe algorithms on the OSN based on modularity but shallnot draw strong conclusions on the accuracies of the com-munity detection due to the above reasons In Section Va novel benchmark proposed by Lancichinetti et al [19]capable of revealing resolution limit of modularity-basedalgorithms is used for further comparisons

Fig 8 depicts the average performance curves over 5runs for both versions of the algorithm applying hop at-tenuation and preferential linkage The results suggestthat on both implementations a slight but not too higha preference on high-degree nodes (m gt 0) can speed upthe process for achieving peak modularity on the OSNnetwork but also gives rise to a steeper drop as shown inFig 8(a) We believe however different magnitudes ofm simply restrict the choice of nodes to different subsetssome of which may contribute to a ldquoglobal pandemicrdquoand some may not By simply using the degree of a nodemay not be a heuristic generic enough for different net-works Further study is required to understand if at allpossible how to deduce a generic preference on neigh-bourhood labels every iteration without resorting to aglobal metric which is costly Nonetheless we show thatgiving preference to certain nodes over others when de-ciding between labels to accept can be beneficial in termsof number of iterations to achieve maximum modularity

Looking at hop attenuation we find that the applica-tion of δ indeed deters the occurrence of the ldquomonsterclustersrdquo as expected and thereby preventing the modu-larity drop after certain iterations But it was also ob-vious that high hop attenuation prevented the healthygrowing of the communities and restricted the increasein modularity (cf Fig 8(b)8(e)) Moreover we con-jecture that hop attenuation restrains the spread of thelabel from an arbitrary center and thereby the formationof circular clusters This suppression in forming non-circular clusters may lead to the suboptimal performancein terms of modularity as shown in the asynchronous

6

SynchronousM

odul

arity

Q

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(a)

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(b)

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(c)

Asynchronous

Modula

rity

Q

Iteration

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(d)

Iteration

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(e)

Iteration

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(f)

FIG 8 Average performance comparisons of the synchronous and asynchronous implementations with varying δ and m over5 Runs

case (Fig 8(e)) Finally from Fig 8(c) and 8(f) wesee that combining both parameters on average bene-fits both versions of the algorithm in achieving a commu-nity partitioning of high modularity more efficiently andconsistently

B Hierarchical amp Overlapping Communities

Communities in certain networks are known to be hier-archical For instance students in the same classes oftenform some strong local communities while these commu-nities say of the same school in turn form a larger butrelatively weaker community As discussed in Section IImost CNM-based algorithms are inherently hierarchicalsince communities are agglomerated by greedy local op-timization of modularity gain

We present two simple modifications to the originalmethod to enable the detection of hierarchical commu-nities Firstly let us consider the application of hop at-tenuation on label propagation Suppose we impose avery high hop attenuation at the beginning we expectcommunities of small diameter to form If we then grad-ually relax the attenuation value we should expect thesesmall communities to merge into larger ones In order toachieve this we modify eq (3) as follows

sprimei(Li) = 1 minus δ(dG(O(Li) i)) (4)

where

dG(O(Li) i) = 1 + miniprimeisinNi(Li)

dG(O(Li) iprime) (5)

Essentially instead of receiving the current hop scoresfrom the neighbourhood and carry out a subtraction thescore is now determined by the actual geodesic distance(dG) from the label Lrsquos origin denoted by O(L) and thefunction δ This gives greater flexibility of δ in terms ofgeodesic distances and can facilitate iteration-dependenthop attenuation as required here with slight extra com-putation cost

Our second proposal is inspired from [16] where wecan similarly treat newly combined communities as a sin-gle node and use the number of inter-community edgesas the weight of edges between these ldquofresh condensedrdquonodes Instead of doing this every iteration we can applycertain amount of hop attenuation or hard limit in termsof the diameter of the community and do this after anequilibrium is reached

Fig 9 gives an illustration of the first modificationapplied on a subgraph on the OSN Note that this mod-ification depends very much on the initial labelling of

7

nodes because it determines the initial centers of thesesmall communities

FIG 9 (Color online) Community detection in the OSN(n=3000) by gradually decreasing hop attenuation (δ = 05at the top with Q = 064 δ = 0 at the bottom with Q = 078)Nodes with 3 or less neighbours are filtered to ease the visu-alisation

Another important question which was also briefly ad-dressed in [4] is the problem of overlapping communi-ties [20] ie nodes can often be considered a memberof different communities From previous sections we un-derstood that different asynchronous version of the al-gorithm is capable of generating very different results indifferent runs This is exactly how [4] suggested as a po-tential solution - to re-run the algorithm several timesIn a parallel environment however the results tend to bemuch less fluctuating An initial attempt was to increasethe number of labels passed each time between nodes toachieve a similar effect Preliminary experiments indi-cate limited success since this setting hampers the con-vergence process possibly due to the potential of latentlabels switching back and fro in the system Another pos-sibility is the exploit the fact that nodes on the borderof its community have different proportions (purity) ofneighbours from other communities We can potentiallyuse that as a measure of membership but this indeed mayonly be applicable to such boundary nodes

C Optimization

The individual inspection of every node particularlythose with many neighbours is a crucial factor in deter-mining the speed of the algorithm Putting aside efficientdata structures and prudent programming an obvious

Iteration

Nodes avoided (time saved) 806040

Abs Diff in Q 806040

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

FIG 10 The difference in modularity and speed of theoptimized modifications with the original

optimization we can do without much compromise on theperformance is to selectively update high degree nodesThe reader may have realised that after certain itera-tions it would be pointless to update certain nodes thatare well inside a cluster These nodes are surrounded bynodes with the same label which are unlikely to changefor the same reason We employ a simple purity mea-sure of neighbours to selectively update nodes that areon the borders of their communities In other words weonly update nodes whose number of neighbours sharingthe maximal label is less than a certain percentage In-deed small degree nodes are likely to be avoided in earlyiterations in this setting but their contributions to theoverall community structure and performance are almostinsignificant We carry out the modified algorithm withthresholds set at 100 (equivalent to the unmodified al-gorithm) 80 60 and 40 to examine the trade offbetween accuracy and speed

Figure 10 reveals that after the 1st iteration the ex-tra constraint will increasingly avoid updating nodes Asmore nodes settle in a more stable cluster increasinglyless amount of time will be required in an iteration In-terestingly even with a threshold as low as 40 the ab-solute difference in modularity compared to the originalsetting is reasonably small and we can see the overallrunning time can be significantly reduced

D Parallel amp Online Analysis

Clear advantages of label propagation include its easeto be parallelized and its potential online implementationin real time networks Since each node is required onlyto know information about its neighbours and updatesitself according to the common rules parallelism can beeasily achieved This brings us to another technical pointthat when the algorithm is completely parallelized evenwithout explicit synchronization it would tend to behave

8

like the synchronous version of the algorithm And this isthe key reason why we have stressed in the literature thatimproving both synchronous and asynchronous versionsof the algorithm are equally important

The running time in a parallel environment effectivelyreduces to k if there are Θ(n) machines This can beachieved in real world ubiquitous system such as a mo-bile ad-hoc network (MANET) or potentially on OSNsthemselves (if members are willing to contribute theircomputational power) in real time For instance socialinformation such as the community structure is known tobenefit routing in MANET [21] Moreover in such sce-narios the space requirement for storing link informationwould become decentralised and thus insignificant

On the same note we see great potential in adaptingthe algorithm for online community detection in real-timedynamic networks where the presence of nodes and edgesare constantly evolving The microscopic movements andintermittent presence of nodes contribute to changes interms of weights of the edges These in turn result in fivedistinct macroscopic behaviours of communities namelygrowth shrinkage union division and death of commu-nities The challenge indeed is to detect local changeswithout the need for a global update given limited com-putational resource or time constraint We believe labelpropagation is particularly suited in this paradigm andthus propose this as future work

V COMPARISONS

We first look at two relatively large and previouslystudied networks for comparisons These networks arerespectively the Amazon Purchasing Network analysedin [13] and the actor collaboration network [22] As donein [13] we assume all edges to be undirected to ease theanalysis With the added heuristics the algorithm is ableto perform within 5 of CNM and 10 of the adapta-tion by Danon Dıaz-Guilera and Arenas (CNM-DDA)[14] in terms of modularity (cf Table I) LPA how-ever achieves the result in a matter of minutes which isunparalleled by the above

For a more standardized comparison we turn to the re-cently proposed benchmark graphs by Lancichinetti etal [19] an extension to the well known GN benchmark[12] which incorporates more realistic scale-free degreeand cluster-size distributions We follow closely the im-plementation of the benchmark graphs as described in[19] and compare the original LPA with the improvedversion on the graphs of size 1000 and 5000 To contrastlabel propagation with general fast modularity maximi-sation algorithms we also run the benchmarks on theCNM algorithm

As shown in Fig 11 both implementations achieve su-perior accuracy over CNM in terms of normalised mutualinformation (NMI) even up to a mixing parameter of 06Interestingly the original method shows signs of failureat micro = 05 in the N = 1000 d = 50 benchmark graphs

(cf Fig 11(b)) We believe this corresponds to theformation of monster communities discussed in SectionIVA The number of nodes and the average degree ofthe benchmark graphs in effect dictate the number andsizes of the original communities generated The resultshence point out that denser and less modularized graphsare relatively prone to the formation of monster commu-nities However the application of hop attenuation asexemplified in Fig 11(b) greatly improves the overallperformance of LPA in such scenarios

Importantly as opposed to label propagation we cansee that CNM algorithmrsquos performance does not merelydepend on the mixing parameter but also the averagedegree of the network Resolution limit of modularitymaximization is reflected by CNMrsquos worse performancein graphs having a smaller average degree Although inmost configurations all algorithms expectedly manage touncover a modularity value of a similar magnitude thereal accuracy in terms of NMI does not follow Thisfinding corresponds to the notion in [17] that modularitymaximisation does not simply translate to actual com-munities

VI CONCLUSIONS

In this literature we have empirically analysed a scal-able efficient and accurate community detection algo-rithm We discussed the behaviours and emphasized theimportance of both the synchronous and asynchronousimplementations of the algorithm We suggested poten-tial heuristics that can be applied to improve its aver-age detection performance and adaptability Most im-portantly we contrasted the algorithm with modularity-gain based methods in terms of community detection ac-curacy and observed how it can be potentially appliedonline and concurrently in large-scale and real-time dy-namic networks

Understanding the dynamics of this algorithm wouldbe the major future work of this discipline before onedevises further heuristics to improve the algorithm Webelieve that each notion discussed in Section IV is wor-thy of further inspection An equally important point isto analyse mathematically or empirically on how to bestadapt the algorithm to different types of networks by theadded heuristics How do different network topologiesand models affect the algorithmrsquos convergent behaviourThese are all valuable questions to be investigated in fu-ture work

In summary we show that label propagation with theappropriate modifications is a more reliable and efficientmethod in detecting communities in large-scale networksthan popular existing methods We trust that with fur-ther understanding and analysis epidemic-based commu-nity detection would be of substantial value to the field

9

Network Size Directed Links Q(Claimed) Peak Q(Sync) Peak Q(Async)

Amazon Purchase(Marrsquo03) 409687 4929260 0745 [13] 0724 0727Actor Collaboration 374511 30052912 0528 [4] 0719 [14] 0642 0660

TABLE I The results correspond to the peak modularity achieved in 10 iterations or less with f = Deg and m = 01 and agradually decreasing δ as discussed in Section IV B

CNM LPA LPA-δ

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(a)N = 1000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06N

MI

Mixing parameter micro

(b)N = 1000 d = 50

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(c)N = 5000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(d)N = 5000 d = 50

FIG 11 Average performance comparisons between the three algorithms in the benchmark graphs with size N and averagedegree d Both versions of LPA here are asynchronous LPA-δ implements a gradually decreasing δ as discussed in SectionIVB All benchmark graphs have power-law degree and cluster-size distributions with exponent 3 and 2 For N = 1000 theresults are the average over 100 realisations for N = 5000 over 10 realisations

Acknowledgments

We thank Franco Bagnoli and Vito Latora for help-ful comments We are grateful to Eric Promislow for

providing us with the Amazon network data Networkvisualisations are carried out on Cytoscape [24] Thisproject is supported by EC IST SOCIALNETS - Grantagreement number 217141

[1] J Kleinberg in STOC rsquo00 Proceedings of the thirty-

second annual ACM symposium on Theory of computing

(ACM New York NY USA 2000) pp 163ndash170 ISBN

10

1-58113-184-4[2] D J Watts and S H Strogatz Nature (London) 393

440 (1998)[3] A Mislove M Marcon K P Gummadi P Druschel

and B Bhattacharjee in IMC rsquo07 Proceedings of the 7th

ACM SIGCOMM conference on Internet measurement

(ACM New York NY USA 2007) pp 29ndash42[4] U N Raghavan R Albert and S Kumara Phys Rev

E 76 036106 (pages 11) (2007)[5] D Lusseau and M E J Newman Proc R Soc London

B 271 S477 (2004)[6] G W Flake S Lawrence et al IEEE Computer 35 66

(2002)[7] E Ravasz A L Somera D A Mongru Z N Oltvai

and A L Barabasi Science 297 1551 (2002)[8] M E J Newman Eur Phys J B 38 321 (2004)[9] L Danon J Duch et al J Stat Mech p P09008

(2005)[10] S Boccaletti V Latora Y Moreno M Chavez and D-

U Hwang Phys Rep 424 175 (2006)[11] F Radicchi C Castellano F Cecconi V Loreto and

D Parisi PNAS 101 2658 (2004)[12] M E Newman and M Girvan Phys Rev E 69 026113

(2004)[13] A Clauset M E J Newman and C Moore Phys Rev

E 70 066111 (2004)[14] L Danon A Dıaz-Guilera and A Arenas J Stat Mech

2006 P11010 (2006)[15] K Wakita and T Tsurumi in WWW rsquo07 Proceedings

of the 16th international conference on World Wide Web

(ACM New York NY USA 2007) pp 1275ndash1276[16] V D Blondel J-L Guillaume R Lambiotte and

E Lefebvre J Stat Mech 10 P10008 (2008) 08030476[17] S Fortunato and M Barthelemy PNAS 104 36 (2007)[18] R Albert and A-L Barabasi Rev Mod Phys 74 47

(2002)[19] A Lancichinetti S Fortunato and F Radicchi Phys

Rev E 78 046110 (2008)[20] G Palla I Derenyi et al Nature 435 814 (2005) ISSN

0028-0836[21] P Hui J Crowcroft and E Yoneki in MobiHoc rsquo08

Proceedings of the 9th ACM international symposium on

Mobile ad hoc networking and computing (ACM NewYork NY USA 2008) pp 241ndash250 ISBN 978-1-60558-073-9

[22] A-L Barabasi and R Albert Science 286 509 (1999)[23] See the Facebookcom Six Degrees Project[24] httpwwwcytoscapeorg

4

0

5

10

15

20

25

1 10 100 1000

k

N

SyncAsync

FIG 5 The relationships between the number of iterationsrequired before convergence k of both implementations tothe size N of the aforementioned structure All values hereare averaged over 100 realisations

convergence We thus consider the understanding of theconvergent behaviour in large complex networks such asOSNs as a direction for further investigation

B Community Detection in OSN

We carry out community detection on the aforemen-tioned OSN using a desktop PC with 4GB ram and a 24GHz quad-core processor running 32-bit Java VM 16Due to limited memory we restrict the number of nodesto the first million Since the order of nodes in the origi-nal data corresponds to that of a breath-first web crawl-ing this way of ldquocutting offrdquo the data is equivalent to ex-tracting a snowball sample As discussed in [3] snowballmethods are known to over-sample high-degree nodesunder-sample low-degree ones and overestimate the av-erage node degree This is seen by the higher average de-gree of the subgraph 250 compared to 106 of the originalgraph Nonetheless since the purpose of this literatureis to evaluate the algorithm on large-scale networks thesampled network satisfies our requirements The sam-pled subgraph contains 1000000 nodes and 58793458directed links Convergent behaviours of the two differ-ent implementations are shown in Fig 6

A crucial point is that in a complex network as largeas this the so called ldquoconvergencerdquo does not necessarilyyield an optimal result in terms of modularity For exam-ple we see the asynchronous implementation merely tookon average 5 iterations to achieve a maximum modularitybut has highly volatile results in different runs as depictedby the shaded area in the figure On the other hand thesynchronous implementation achieved maximum modu-larity much slower than the asynchronous version butits performance on average is much more stable (its per-formance range is thus omitted) The performances of

Mod

ular

ity Q

No

of C

omm

uniti

es

Iteration

Range(Async)Avg Q(Async)Avg Q(Sync)

Avg No of Comm(Async)Avg No of Comm(Sync)

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

10000

20000

30000

40000

50000

FIG 6 Average performances of asynchronous and syn-chronous LPA Values are averaged over 5 Runs Shaded areadenotes the range of the performances of asynchronous imple-mentation

these two different implementations are equally impor-tant to be understood and utilised Further discussionson the implications of these implementations and theirutilizations are given in Section IV

Each single-threaded iteration finishes in a matter oftens of second and thus depending on the stopping cri-terion it can take as little as 8 to 10 minutes up to peakperformance Extrapolating the time required with re-spect to the number of edges the algorithm without anyoptimization should be able to detect communities on agraph with 1 billion edges in less than 180 minutes in amagnitude similar to that in [16]

Fig 7 shows the distribution of communityclustersize collected by a specific run of the asynchronous ver-sion of the algorithm when the modularity peaked at0638 The size distribution of communities within theOSN follows a 2-part power law distribution in the com-plementary CDF with an estimated coefficient of 11The interested reader is referred to [10 18] for discus-sions on the characteristics of different networks

IV A MORE RELIABLE AND EFFICIENT

ALGORITHM

In this section we discuss potential modifications tothe algorithm to increase its reliability functionality andcomputational efficiency

A Hop Attenuation amp Node Preference

Due to the ldquoepidemicrdquo nature of the algorithm a majorlimitation of the algorithm is noted where certain ldquolabelepidemicrdquo manages to ldquoplaguerdquo a large amount of nodesTo be exact in some runs a certain community of size

5

P(S

gts)

Community Size s

00001

0001

001

01

1

1 10 100 1000 10000 100000 1e+006

FIG 7 The community-size distribution of communities un-covered by the algorithm which follows a 2-part power law

over 500000 (50 of the number of nodes) is formedmdashasopposed to the thousand other counterparts averagelysized in a magnitude of 100smdashgreatly contributing tomodularity drop after the peak We conjecture that thisis partially due to the asynchronous nature of the algo-rithm and the initial formation of communities wherecertain communities do not form strong enough links toprevent a foreign ldquoepidemicrdquo to sweep through Furtherexperiments confirm that the synchronous version of thealgorithm slows down the formation of such ldquomonsterrdquocommunities but do not prevent them

We propose an extension to this algorithm by addinga score associated with the label which decreases as ittraverses from its origin A node is initially given a scoreof 10 for its label After a node i has collected fromits neighbourhood Ni all the respective labels and thescores the calculation of the new maximal label Lprime canbe generalised by

Lprimei = argmax

L

sum

iprimeisinNi

siprime(Liprime) middot f(iprime)mmiddot wiprimei (2)

where Li is the label of node i si(L) is the hop scoreof label L in i wiprimei is the weight of the edge betweeniprime and i (we sum the weights in both directions if thegraph is directed) and f(i) is any arbitrary comparablecharacteristic for any node i For instance if we definef(i) = Deg(i) when m gt 0 more preference is given tonode with more neighbours m lt 0 less The final stepis to assign a new attenuated score sprime to the new label Lprime

of i by subtracting hop attenuation δ 0 lt δ lt 1

sprimei(Lprimei) =

(

maxiprimeisinNi(Lprime

i)si(Liprime)

)

minus δ (3)

where Ni(L) is the set of neighbours of i that has la-bel L The value δ governs how far a particular label

can spread as a function of the geodesic distance from itsorigin This additional parameter adds in extra uncer-tainties to the algorithm but may encourage a strongerlocal community to form before a large cluster start todominate Ideally the selection of δ can even be adaptiveto current number of iteration the neighbourhood of thenode concerned and perhaps some a priori network pa-rameters We investigate the use of varying δ in the nextsection and assume here a constant value for δ Notethat this setting may induce a negative feedback loopwe therefore let δ = 0 if the selected label is equal to thecurrent label

As discussed modularity has been widely used in theliterature as a metric to contrast the community detec-tion capabilities on real world networks between differ-ent algorithms Whilst high modularity indicates a sig-nificant modularised structure over a randomised graphof the network concerned the correspondence betweenhigh modularity and accurately partitioned communitiesis not well understood due to the resolution limit of mod-ularity Here we attempt to contrast the behaviours ofthe algorithms on the OSN based on modularity but shallnot draw strong conclusions on the accuracies of the com-munity detection due to the above reasons In Section Va novel benchmark proposed by Lancichinetti et al [19]capable of revealing resolution limit of modularity-basedalgorithms is used for further comparisons

Fig 8 depicts the average performance curves over 5runs for both versions of the algorithm applying hop at-tenuation and preferential linkage The results suggestthat on both implementations a slight but not too higha preference on high-degree nodes (m gt 0) can speed upthe process for achieving peak modularity on the OSNnetwork but also gives rise to a steeper drop as shown inFig 8(a) We believe however different magnitudes ofm simply restrict the choice of nodes to different subsetssome of which may contribute to a ldquoglobal pandemicrdquoand some may not By simply using the degree of a nodemay not be a heuristic generic enough for different net-works Further study is required to understand if at allpossible how to deduce a generic preference on neigh-bourhood labels every iteration without resorting to aglobal metric which is costly Nonetheless we show thatgiving preference to certain nodes over others when de-ciding between labels to accept can be beneficial in termsof number of iterations to achieve maximum modularity

Looking at hop attenuation we find that the applica-tion of δ indeed deters the occurrence of the ldquomonsterclustersrdquo as expected and thereby preventing the modu-larity drop after certain iterations But it was also ob-vious that high hop attenuation prevented the healthygrowing of the communities and restricted the increasein modularity (cf Fig 8(b)8(e)) Moreover we con-jecture that hop attenuation restrains the spread of thelabel from an arbitrary center and thereby the formationof circular clusters This suppression in forming non-circular clusters may lead to the suboptimal performancein terms of modularity as shown in the asynchronous

6

SynchronousM

odul

arity

Q

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(a)

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(b)

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(c)

Asynchronous

Modula

rity

Q

Iteration

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(d)

Iteration

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(e)

Iteration

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(f)

FIG 8 Average performance comparisons of the synchronous and asynchronous implementations with varying δ and m over5 Runs

case (Fig 8(e)) Finally from Fig 8(c) and 8(f) wesee that combining both parameters on average bene-fits both versions of the algorithm in achieving a commu-nity partitioning of high modularity more efficiently andconsistently

B Hierarchical amp Overlapping Communities

Communities in certain networks are known to be hier-archical For instance students in the same classes oftenform some strong local communities while these commu-nities say of the same school in turn form a larger butrelatively weaker community As discussed in Section IImost CNM-based algorithms are inherently hierarchicalsince communities are agglomerated by greedy local op-timization of modularity gain

We present two simple modifications to the originalmethod to enable the detection of hierarchical commu-nities Firstly let us consider the application of hop at-tenuation on label propagation Suppose we impose avery high hop attenuation at the beginning we expectcommunities of small diameter to form If we then grad-ually relax the attenuation value we should expect thesesmall communities to merge into larger ones In order toachieve this we modify eq (3) as follows

sprimei(Li) = 1 minus δ(dG(O(Li) i)) (4)

where

dG(O(Li) i) = 1 + miniprimeisinNi(Li)

dG(O(Li) iprime) (5)

Essentially instead of receiving the current hop scoresfrom the neighbourhood and carry out a subtraction thescore is now determined by the actual geodesic distance(dG) from the label Lrsquos origin denoted by O(L) and thefunction δ This gives greater flexibility of δ in terms ofgeodesic distances and can facilitate iteration-dependenthop attenuation as required here with slight extra com-putation cost

Our second proposal is inspired from [16] where wecan similarly treat newly combined communities as a sin-gle node and use the number of inter-community edgesas the weight of edges between these ldquofresh condensedrdquonodes Instead of doing this every iteration we can applycertain amount of hop attenuation or hard limit in termsof the diameter of the community and do this after anequilibrium is reached

Fig 9 gives an illustration of the first modificationapplied on a subgraph on the OSN Note that this mod-ification depends very much on the initial labelling of

7

nodes because it determines the initial centers of thesesmall communities

FIG 9 (Color online) Community detection in the OSN(n=3000) by gradually decreasing hop attenuation (δ = 05at the top with Q = 064 δ = 0 at the bottom with Q = 078)Nodes with 3 or less neighbours are filtered to ease the visu-alisation

Another important question which was also briefly ad-dressed in [4] is the problem of overlapping communi-ties [20] ie nodes can often be considered a memberof different communities From previous sections we un-derstood that different asynchronous version of the al-gorithm is capable of generating very different results indifferent runs This is exactly how [4] suggested as a po-tential solution - to re-run the algorithm several timesIn a parallel environment however the results tend to bemuch less fluctuating An initial attempt was to increasethe number of labels passed each time between nodes toachieve a similar effect Preliminary experiments indi-cate limited success since this setting hampers the con-vergence process possibly due to the potential of latentlabels switching back and fro in the system Another pos-sibility is the exploit the fact that nodes on the borderof its community have different proportions (purity) ofneighbours from other communities We can potentiallyuse that as a measure of membership but this indeed mayonly be applicable to such boundary nodes

C Optimization

The individual inspection of every node particularlythose with many neighbours is a crucial factor in deter-mining the speed of the algorithm Putting aside efficientdata structures and prudent programming an obvious

Iteration

Nodes avoided (time saved) 806040

Abs Diff in Q 806040

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

FIG 10 The difference in modularity and speed of theoptimized modifications with the original

optimization we can do without much compromise on theperformance is to selectively update high degree nodesThe reader may have realised that after certain itera-tions it would be pointless to update certain nodes thatare well inside a cluster These nodes are surrounded bynodes with the same label which are unlikely to changefor the same reason We employ a simple purity mea-sure of neighbours to selectively update nodes that areon the borders of their communities In other words weonly update nodes whose number of neighbours sharingthe maximal label is less than a certain percentage In-deed small degree nodes are likely to be avoided in earlyiterations in this setting but their contributions to theoverall community structure and performance are almostinsignificant We carry out the modified algorithm withthresholds set at 100 (equivalent to the unmodified al-gorithm) 80 60 and 40 to examine the trade offbetween accuracy and speed

Figure 10 reveals that after the 1st iteration the ex-tra constraint will increasingly avoid updating nodes Asmore nodes settle in a more stable cluster increasinglyless amount of time will be required in an iteration In-terestingly even with a threshold as low as 40 the ab-solute difference in modularity compared to the originalsetting is reasonably small and we can see the overallrunning time can be significantly reduced

D Parallel amp Online Analysis

Clear advantages of label propagation include its easeto be parallelized and its potential online implementationin real time networks Since each node is required onlyto know information about its neighbours and updatesitself according to the common rules parallelism can beeasily achieved This brings us to another technical pointthat when the algorithm is completely parallelized evenwithout explicit synchronization it would tend to behave

8

like the synchronous version of the algorithm And this isthe key reason why we have stressed in the literature thatimproving both synchronous and asynchronous versionsof the algorithm are equally important

The running time in a parallel environment effectivelyreduces to k if there are Θ(n) machines This can beachieved in real world ubiquitous system such as a mo-bile ad-hoc network (MANET) or potentially on OSNsthemselves (if members are willing to contribute theircomputational power) in real time For instance socialinformation such as the community structure is known tobenefit routing in MANET [21] Moreover in such sce-narios the space requirement for storing link informationwould become decentralised and thus insignificant

On the same note we see great potential in adaptingthe algorithm for online community detection in real-timedynamic networks where the presence of nodes and edgesare constantly evolving The microscopic movements andintermittent presence of nodes contribute to changes interms of weights of the edges These in turn result in fivedistinct macroscopic behaviours of communities namelygrowth shrinkage union division and death of commu-nities The challenge indeed is to detect local changeswithout the need for a global update given limited com-putational resource or time constraint We believe labelpropagation is particularly suited in this paradigm andthus propose this as future work

V COMPARISONS

We first look at two relatively large and previouslystudied networks for comparisons These networks arerespectively the Amazon Purchasing Network analysedin [13] and the actor collaboration network [22] As donein [13] we assume all edges to be undirected to ease theanalysis With the added heuristics the algorithm is ableto perform within 5 of CNM and 10 of the adapta-tion by Danon Dıaz-Guilera and Arenas (CNM-DDA)[14] in terms of modularity (cf Table I) LPA how-ever achieves the result in a matter of minutes which isunparalleled by the above

For a more standardized comparison we turn to the re-cently proposed benchmark graphs by Lancichinetti etal [19] an extension to the well known GN benchmark[12] which incorporates more realistic scale-free degreeand cluster-size distributions We follow closely the im-plementation of the benchmark graphs as described in[19] and compare the original LPA with the improvedversion on the graphs of size 1000 and 5000 To contrastlabel propagation with general fast modularity maximi-sation algorithms we also run the benchmarks on theCNM algorithm

As shown in Fig 11 both implementations achieve su-perior accuracy over CNM in terms of normalised mutualinformation (NMI) even up to a mixing parameter of 06Interestingly the original method shows signs of failureat micro = 05 in the N = 1000 d = 50 benchmark graphs

(cf Fig 11(b)) We believe this corresponds to theformation of monster communities discussed in SectionIVA The number of nodes and the average degree ofthe benchmark graphs in effect dictate the number andsizes of the original communities generated The resultshence point out that denser and less modularized graphsare relatively prone to the formation of monster commu-nities However the application of hop attenuation asexemplified in Fig 11(b) greatly improves the overallperformance of LPA in such scenarios

Importantly as opposed to label propagation we cansee that CNM algorithmrsquos performance does not merelydepend on the mixing parameter but also the averagedegree of the network Resolution limit of modularitymaximization is reflected by CNMrsquos worse performancein graphs having a smaller average degree Although inmost configurations all algorithms expectedly manage touncover a modularity value of a similar magnitude thereal accuracy in terms of NMI does not follow Thisfinding corresponds to the notion in [17] that modularitymaximisation does not simply translate to actual com-munities

VI CONCLUSIONS

In this literature we have empirically analysed a scal-able efficient and accurate community detection algo-rithm We discussed the behaviours and emphasized theimportance of both the synchronous and asynchronousimplementations of the algorithm We suggested poten-tial heuristics that can be applied to improve its aver-age detection performance and adaptability Most im-portantly we contrasted the algorithm with modularity-gain based methods in terms of community detection ac-curacy and observed how it can be potentially appliedonline and concurrently in large-scale and real-time dy-namic networks

Understanding the dynamics of this algorithm wouldbe the major future work of this discipline before onedevises further heuristics to improve the algorithm Webelieve that each notion discussed in Section IV is wor-thy of further inspection An equally important point isto analyse mathematically or empirically on how to bestadapt the algorithm to different types of networks by theadded heuristics How do different network topologiesand models affect the algorithmrsquos convergent behaviourThese are all valuable questions to be investigated in fu-ture work

In summary we show that label propagation with theappropriate modifications is a more reliable and efficientmethod in detecting communities in large-scale networksthan popular existing methods We trust that with fur-ther understanding and analysis epidemic-based commu-nity detection would be of substantial value to the field

9

Network Size Directed Links Q(Claimed) Peak Q(Sync) Peak Q(Async)

Amazon Purchase(Marrsquo03) 409687 4929260 0745 [13] 0724 0727Actor Collaboration 374511 30052912 0528 [4] 0719 [14] 0642 0660

TABLE I The results correspond to the peak modularity achieved in 10 iterations or less with f = Deg and m = 01 and agradually decreasing δ as discussed in Section IV B

CNM LPA LPA-δ

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(a)N = 1000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06N

MI

Mixing parameter micro

(b)N = 1000 d = 50

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(c)N = 5000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(d)N = 5000 d = 50

FIG 11 Average performance comparisons between the three algorithms in the benchmark graphs with size N and averagedegree d Both versions of LPA here are asynchronous LPA-δ implements a gradually decreasing δ as discussed in SectionIVB All benchmark graphs have power-law degree and cluster-size distributions with exponent 3 and 2 For N = 1000 theresults are the average over 100 realisations for N = 5000 over 10 realisations

Acknowledgments

We thank Franco Bagnoli and Vito Latora for help-ful comments We are grateful to Eric Promislow for

providing us with the Amazon network data Networkvisualisations are carried out on Cytoscape [24] Thisproject is supported by EC IST SOCIALNETS - Grantagreement number 217141

[1] J Kleinberg in STOC rsquo00 Proceedings of the thirty-

second annual ACM symposium on Theory of computing

(ACM New York NY USA 2000) pp 163ndash170 ISBN

10

1-58113-184-4[2] D J Watts and S H Strogatz Nature (London) 393

440 (1998)[3] A Mislove M Marcon K P Gummadi P Druschel

and B Bhattacharjee in IMC rsquo07 Proceedings of the 7th

ACM SIGCOMM conference on Internet measurement

(ACM New York NY USA 2007) pp 29ndash42[4] U N Raghavan R Albert and S Kumara Phys Rev

E 76 036106 (pages 11) (2007)[5] D Lusseau and M E J Newman Proc R Soc London

B 271 S477 (2004)[6] G W Flake S Lawrence et al IEEE Computer 35 66

(2002)[7] E Ravasz A L Somera D A Mongru Z N Oltvai

and A L Barabasi Science 297 1551 (2002)[8] M E J Newman Eur Phys J B 38 321 (2004)[9] L Danon J Duch et al J Stat Mech p P09008

(2005)[10] S Boccaletti V Latora Y Moreno M Chavez and D-

U Hwang Phys Rep 424 175 (2006)[11] F Radicchi C Castellano F Cecconi V Loreto and

D Parisi PNAS 101 2658 (2004)[12] M E Newman and M Girvan Phys Rev E 69 026113

(2004)[13] A Clauset M E J Newman and C Moore Phys Rev

E 70 066111 (2004)[14] L Danon A Dıaz-Guilera and A Arenas J Stat Mech

2006 P11010 (2006)[15] K Wakita and T Tsurumi in WWW rsquo07 Proceedings

of the 16th international conference on World Wide Web

(ACM New York NY USA 2007) pp 1275ndash1276[16] V D Blondel J-L Guillaume R Lambiotte and

E Lefebvre J Stat Mech 10 P10008 (2008) 08030476[17] S Fortunato and M Barthelemy PNAS 104 36 (2007)[18] R Albert and A-L Barabasi Rev Mod Phys 74 47

(2002)[19] A Lancichinetti S Fortunato and F Radicchi Phys

Rev E 78 046110 (2008)[20] G Palla I Derenyi et al Nature 435 814 (2005) ISSN

0028-0836[21] P Hui J Crowcroft and E Yoneki in MobiHoc rsquo08

Proceedings of the 9th ACM international symposium on

Mobile ad hoc networking and computing (ACM NewYork NY USA 2008) pp 241ndash250 ISBN 978-1-60558-073-9

[22] A-L Barabasi and R Albert Science 286 509 (1999)[23] See the Facebookcom Six Degrees Project[24] httpwwwcytoscapeorg

5

P(S

gts)

Community Size s

00001

0001

001

01

1

1 10 100 1000 10000 100000 1e+006

FIG 7 The community-size distribution of communities un-covered by the algorithm which follows a 2-part power law

over 500000 (50 of the number of nodes) is formedmdashasopposed to the thousand other counterparts averagelysized in a magnitude of 100smdashgreatly contributing tomodularity drop after the peak We conjecture that thisis partially due to the asynchronous nature of the algo-rithm and the initial formation of communities wherecertain communities do not form strong enough links toprevent a foreign ldquoepidemicrdquo to sweep through Furtherexperiments confirm that the synchronous version of thealgorithm slows down the formation of such ldquomonsterrdquocommunities but do not prevent them

We propose an extension to this algorithm by addinga score associated with the label which decreases as ittraverses from its origin A node is initially given a scoreof 10 for its label After a node i has collected fromits neighbourhood Ni all the respective labels and thescores the calculation of the new maximal label Lprime canbe generalised by

Lprimei = argmax

L

sum

iprimeisinNi

siprime(Liprime) middot f(iprime)mmiddot wiprimei (2)

where Li is the label of node i si(L) is the hop scoreof label L in i wiprimei is the weight of the edge betweeniprime and i (we sum the weights in both directions if thegraph is directed) and f(i) is any arbitrary comparablecharacteristic for any node i For instance if we definef(i) = Deg(i) when m gt 0 more preference is given tonode with more neighbours m lt 0 less The final stepis to assign a new attenuated score sprime to the new label Lprime

of i by subtracting hop attenuation δ 0 lt δ lt 1

sprimei(Lprimei) =

(

maxiprimeisinNi(Lprime

i)si(Liprime)

)

minus δ (3)

where Ni(L) is the set of neighbours of i that has la-bel L The value δ governs how far a particular label

can spread as a function of the geodesic distance from itsorigin This additional parameter adds in extra uncer-tainties to the algorithm but may encourage a strongerlocal community to form before a large cluster start todominate Ideally the selection of δ can even be adaptiveto current number of iteration the neighbourhood of thenode concerned and perhaps some a priori network pa-rameters We investigate the use of varying δ in the nextsection and assume here a constant value for δ Notethat this setting may induce a negative feedback loopwe therefore let δ = 0 if the selected label is equal to thecurrent label

As discussed modularity has been widely used in theliterature as a metric to contrast the community detec-tion capabilities on real world networks between differ-ent algorithms Whilst high modularity indicates a sig-nificant modularised structure over a randomised graphof the network concerned the correspondence betweenhigh modularity and accurately partitioned communitiesis not well understood due to the resolution limit of mod-ularity Here we attempt to contrast the behaviours ofthe algorithms on the OSN based on modularity but shallnot draw strong conclusions on the accuracies of the com-munity detection due to the above reasons In Section Va novel benchmark proposed by Lancichinetti et al [19]capable of revealing resolution limit of modularity-basedalgorithms is used for further comparisons

Fig 8 depicts the average performance curves over 5runs for both versions of the algorithm applying hop at-tenuation and preferential linkage The results suggestthat on both implementations a slight but not too higha preference on high-degree nodes (m gt 0) can speed upthe process for achieving peak modularity on the OSNnetwork but also gives rise to a steeper drop as shown inFig 8(a) We believe however different magnitudes ofm simply restrict the choice of nodes to different subsetssome of which may contribute to a ldquoglobal pandemicrdquoand some may not By simply using the degree of a nodemay not be a heuristic generic enough for different net-works Further study is required to understand if at allpossible how to deduce a generic preference on neigh-bourhood labels every iteration without resorting to aglobal metric which is costly Nonetheless we show thatgiving preference to certain nodes over others when de-ciding between labels to accept can be beneficial in termsof number of iterations to achieve maximum modularity

Looking at hop attenuation we find that the applica-tion of δ indeed deters the occurrence of the ldquomonsterclustersrdquo as expected and thereby preventing the modu-larity drop after certain iterations But it was also ob-vious that high hop attenuation prevented the healthygrowing of the communities and restricted the increasein modularity (cf Fig 8(b)8(e)) Moreover we con-jecture that hop attenuation restrains the spread of thelabel from an arbitrary center and thereby the formationof circular clusters This suppression in forming non-circular clusters may lead to the suboptimal performancein terms of modularity as shown in the asynchronous

6

SynchronousM

odul

arity

Q

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(a)

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(b)

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(c)

Asynchronous

Modula

rity

Q

Iteration

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(d)

Iteration

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(e)

Iteration

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(f)

FIG 8 Average performance comparisons of the synchronous and asynchronous implementations with varying δ and m over5 Runs

case (Fig 8(e)) Finally from Fig 8(c) and 8(f) wesee that combining both parameters on average bene-fits both versions of the algorithm in achieving a commu-nity partitioning of high modularity more efficiently andconsistently

B Hierarchical amp Overlapping Communities

Communities in certain networks are known to be hier-archical For instance students in the same classes oftenform some strong local communities while these commu-nities say of the same school in turn form a larger butrelatively weaker community As discussed in Section IImost CNM-based algorithms are inherently hierarchicalsince communities are agglomerated by greedy local op-timization of modularity gain

We present two simple modifications to the originalmethod to enable the detection of hierarchical commu-nities Firstly let us consider the application of hop at-tenuation on label propagation Suppose we impose avery high hop attenuation at the beginning we expectcommunities of small diameter to form If we then grad-ually relax the attenuation value we should expect thesesmall communities to merge into larger ones In order toachieve this we modify eq (3) as follows

sprimei(Li) = 1 minus δ(dG(O(Li) i)) (4)

where

dG(O(Li) i) = 1 + miniprimeisinNi(Li)

dG(O(Li) iprime) (5)

Essentially instead of receiving the current hop scoresfrom the neighbourhood and carry out a subtraction thescore is now determined by the actual geodesic distance(dG) from the label Lrsquos origin denoted by O(L) and thefunction δ This gives greater flexibility of δ in terms ofgeodesic distances and can facilitate iteration-dependenthop attenuation as required here with slight extra com-putation cost

Our second proposal is inspired from [16] where wecan similarly treat newly combined communities as a sin-gle node and use the number of inter-community edgesas the weight of edges between these ldquofresh condensedrdquonodes Instead of doing this every iteration we can applycertain amount of hop attenuation or hard limit in termsof the diameter of the community and do this after anequilibrium is reached

Fig 9 gives an illustration of the first modificationapplied on a subgraph on the OSN Note that this mod-ification depends very much on the initial labelling of

7

nodes because it determines the initial centers of thesesmall communities

FIG 9 (Color online) Community detection in the OSN(n=3000) by gradually decreasing hop attenuation (δ = 05at the top with Q = 064 δ = 0 at the bottom with Q = 078)Nodes with 3 or less neighbours are filtered to ease the visu-alisation

Another important question which was also briefly ad-dressed in [4] is the problem of overlapping communi-ties [20] ie nodes can often be considered a memberof different communities From previous sections we un-derstood that different asynchronous version of the al-gorithm is capable of generating very different results indifferent runs This is exactly how [4] suggested as a po-tential solution - to re-run the algorithm several timesIn a parallel environment however the results tend to bemuch less fluctuating An initial attempt was to increasethe number of labels passed each time between nodes toachieve a similar effect Preliminary experiments indi-cate limited success since this setting hampers the con-vergence process possibly due to the potential of latentlabels switching back and fro in the system Another pos-sibility is the exploit the fact that nodes on the borderof its community have different proportions (purity) ofneighbours from other communities We can potentiallyuse that as a measure of membership but this indeed mayonly be applicable to such boundary nodes

C Optimization

The individual inspection of every node particularlythose with many neighbours is a crucial factor in deter-mining the speed of the algorithm Putting aside efficientdata structures and prudent programming an obvious

Iteration

Nodes avoided (time saved) 806040

Abs Diff in Q 806040

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

FIG 10 The difference in modularity and speed of theoptimized modifications with the original

optimization we can do without much compromise on theperformance is to selectively update high degree nodesThe reader may have realised that after certain itera-tions it would be pointless to update certain nodes thatare well inside a cluster These nodes are surrounded bynodes with the same label which are unlikely to changefor the same reason We employ a simple purity mea-sure of neighbours to selectively update nodes that areon the borders of their communities In other words weonly update nodes whose number of neighbours sharingthe maximal label is less than a certain percentage In-deed small degree nodes are likely to be avoided in earlyiterations in this setting but their contributions to theoverall community structure and performance are almostinsignificant We carry out the modified algorithm withthresholds set at 100 (equivalent to the unmodified al-gorithm) 80 60 and 40 to examine the trade offbetween accuracy and speed

Figure 10 reveals that after the 1st iteration the ex-tra constraint will increasingly avoid updating nodes Asmore nodes settle in a more stable cluster increasinglyless amount of time will be required in an iteration In-terestingly even with a threshold as low as 40 the ab-solute difference in modularity compared to the originalsetting is reasonably small and we can see the overallrunning time can be significantly reduced

D Parallel amp Online Analysis

Clear advantages of label propagation include its easeto be parallelized and its potential online implementationin real time networks Since each node is required onlyto know information about its neighbours and updatesitself according to the common rules parallelism can beeasily achieved This brings us to another technical pointthat when the algorithm is completely parallelized evenwithout explicit synchronization it would tend to behave

8

like the synchronous version of the algorithm And this isthe key reason why we have stressed in the literature thatimproving both synchronous and asynchronous versionsof the algorithm are equally important

The running time in a parallel environment effectivelyreduces to k if there are Θ(n) machines This can beachieved in real world ubiquitous system such as a mo-bile ad-hoc network (MANET) or potentially on OSNsthemselves (if members are willing to contribute theircomputational power) in real time For instance socialinformation such as the community structure is known tobenefit routing in MANET [21] Moreover in such sce-narios the space requirement for storing link informationwould become decentralised and thus insignificant

On the same note we see great potential in adaptingthe algorithm for online community detection in real-timedynamic networks where the presence of nodes and edgesare constantly evolving The microscopic movements andintermittent presence of nodes contribute to changes interms of weights of the edges These in turn result in fivedistinct macroscopic behaviours of communities namelygrowth shrinkage union division and death of commu-nities The challenge indeed is to detect local changeswithout the need for a global update given limited com-putational resource or time constraint We believe labelpropagation is particularly suited in this paradigm andthus propose this as future work

V COMPARISONS

We first look at two relatively large and previouslystudied networks for comparisons These networks arerespectively the Amazon Purchasing Network analysedin [13] and the actor collaboration network [22] As donein [13] we assume all edges to be undirected to ease theanalysis With the added heuristics the algorithm is ableto perform within 5 of CNM and 10 of the adapta-tion by Danon Dıaz-Guilera and Arenas (CNM-DDA)[14] in terms of modularity (cf Table I) LPA how-ever achieves the result in a matter of minutes which isunparalleled by the above

For a more standardized comparison we turn to the re-cently proposed benchmark graphs by Lancichinetti etal [19] an extension to the well known GN benchmark[12] which incorporates more realistic scale-free degreeand cluster-size distributions We follow closely the im-plementation of the benchmark graphs as described in[19] and compare the original LPA with the improvedversion on the graphs of size 1000 and 5000 To contrastlabel propagation with general fast modularity maximi-sation algorithms we also run the benchmarks on theCNM algorithm

As shown in Fig 11 both implementations achieve su-perior accuracy over CNM in terms of normalised mutualinformation (NMI) even up to a mixing parameter of 06Interestingly the original method shows signs of failureat micro = 05 in the N = 1000 d = 50 benchmark graphs

(cf Fig 11(b)) We believe this corresponds to theformation of monster communities discussed in SectionIVA The number of nodes and the average degree ofthe benchmark graphs in effect dictate the number andsizes of the original communities generated The resultshence point out that denser and less modularized graphsare relatively prone to the formation of monster commu-nities However the application of hop attenuation asexemplified in Fig 11(b) greatly improves the overallperformance of LPA in such scenarios

Importantly as opposed to label propagation we cansee that CNM algorithmrsquos performance does not merelydepend on the mixing parameter but also the averagedegree of the network Resolution limit of modularitymaximization is reflected by CNMrsquos worse performancein graphs having a smaller average degree Although inmost configurations all algorithms expectedly manage touncover a modularity value of a similar magnitude thereal accuracy in terms of NMI does not follow Thisfinding corresponds to the notion in [17] that modularitymaximisation does not simply translate to actual com-munities

VI CONCLUSIONS

In this literature we have empirically analysed a scal-able efficient and accurate community detection algo-rithm We discussed the behaviours and emphasized theimportance of both the synchronous and asynchronousimplementations of the algorithm We suggested poten-tial heuristics that can be applied to improve its aver-age detection performance and adaptability Most im-portantly we contrasted the algorithm with modularity-gain based methods in terms of community detection ac-curacy and observed how it can be potentially appliedonline and concurrently in large-scale and real-time dy-namic networks

Understanding the dynamics of this algorithm wouldbe the major future work of this discipline before onedevises further heuristics to improve the algorithm Webelieve that each notion discussed in Section IV is wor-thy of further inspection An equally important point isto analyse mathematically or empirically on how to bestadapt the algorithm to different types of networks by theadded heuristics How do different network topologiesand models affect the algorithmrsquos convergent behaviourThese are all valuable questions to be investigated in fu-ture work

In summary we show that label propagation with theappropriate modifications is a more reliable and efficientmethod in detecting communities in large-scale networksthan popular existing methods We trust that with fur-ther understanding and analysis epidemic-based commu-nity detection would be of substantial value to the field

9

Network Size Directed Links Q(Claimed) Peak Q(Sync) Peak Q(Async)

Amazon Purchase(Marrsquo03) 409687 4929260 0745 [13] 0724 0727Actor Collaboration 374511 30052912 0528 [4] 0719 [14] 0642 0660

TABLE I The results correspond to the peak modularity achieved in 10 iterations or less with f = Deg and m = 01 and agradually decreasing δ as discussed in Section IV B

CNM LPA LPA-δ

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(a)N = 1000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06N

MI

Mixing parameter micro

(b)N = 1000 d = 50

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(c)N = 5000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(d)N = 5000 d = 50

FIG 11 Average performance comparisons between the three algorithms in the benchmark graphs with size N and averagedegree d Both versions of LPA here are asynchronous LPA-δ implements a gradually decreasing δ as discussed in SectionIVB All benchmark graphs have power-law degree and cluster-size distributions with exponent 3 and 2 For N = 1000 theresults are the average over 100 realisations for N = 5000 over 10 realisations

Acknowledgments

We thank Franco Bagnoli and Vito Latora for help-ful comments We are grateful to Eric Promislow for

providing us with the Amazon network data Networkvisualisations are carried out on Cytoscape [24] Thisproject is supported by EC IST SOCIALNETS - Grantagreement number 217141

[1] J Kleinberg in STOC rsquo00 Proceedings of the thirty-

second annual ACM symposium on Theory of computing

(ACM New York NY USA 2000) pp 163ndash170 ISBN

10

1-58113-184-4[2] D J Watts and S H Strogatz Nature (London) 393

440 (1998)[3] A Mislove M Marcon K P Gummadi P Druschel

and B Bhattacharjee in IMC rsquo07 Proceedings of the 7th

ACM SIGCOMM conference on Internet measurement

(ACM New York NY USA 2007) pp 29ndash42[4] U N Raghavan R Albert and S Kumara Phys Rev

E 76 036106 (pages 11) (2007)[5] D Lusseau and M E J Newman Proc R Soc London

B 271 S477 (2004)[6] G W Flake S Lawrence et al IEEE Computer 35 66

(2002)[7] E Ravasz A L Somera D A Mongru Z N Oltvai

and A L Barabasi Science 297 1551 (2002)[8] M E J Newman Eur Phys J B 38 321 (2004)[9] L Danon J Duch et al J Stat Mech p P09008

(2005)[10] S Boccaletti V Latora Y Moreno M Chavez and D-

U Hwang Phys Rep 424 175 (2006)[11] F Radicchi C Castellano F Cecconi V Loreto and

D Parisi PNAS 101 2658 (2004)[12] M E Newman and M Girvan Phys Rev E 69 026113

(2004)[13] A Clauset M E J Newman and C Moore Phys Rev

E 70 066111 (2004)[14] L Danon A Dıaz-Guilera and A Arenas J Stat Mech

2006 P11010 (2006)[15] K Wakita and T Tsurumi in WWW rsquo07 Proceedings

of the 16th international conference on World Wide Web

(ACM New York NY USA 2007) pp 1275ndash1276[16] V D Blondel J-L Guillaume R Lambiotte and

E Lefebvre J Stat Mech 10 P10008 (2008) 08030476[17] S Fortunato and M Barthelemy PNAS 104 36 (2007)[18] R Albert and A-L Barabasi Rev Mod Phys 74 47

(2002)[19] A Lancichinetti S Fortunato and F Radicchi Phys

Rev E 78 046110 (2008)[20] G Palla I Derenyi et al Nature 435 814 (2005) ISSN

0028-0836[21] P Hui J Crowcroft and E Yoneki in MobiHoc rsquo08

Proceedings of the 9th ACM international symposium on

Mobile ad hoc networking and computing (ACM NewYork NY USA 2008) pp 241ndash250 ISBN 978-1-60558-073-9

[22] A-L Barabasi and R Albert Science 286 509 (1999)[23] See the Facebookcom Six Degrees Project[24] httpwwwcytoscapeorg

6

SynchronousM

odul

arity

Q

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(a)

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(b)

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(c)

Asynchronous

Modula

rity

Q

Iteration

Avg Q(m = 0)m = 01m = 02

m = -01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(d)

Iteration

Avg Q(δ = 0)δ = 005

δ = 01δ = 02

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(e)

Iteration

Avg Q(m = 0 δ = 0)m = 01 δ = 005m = 005 δ = 01m = 01 δ = 01

0

01

02

03

04

05

06

07

2 4 6 8 10 12 14 16 18 20

(f)

FIG 8 Average performance comparisons of the synchronous and asynchronous implementations with varying δ and m over5 Runs

case (Fig 8(e)) Finally from Fig 8(c) and 8(f) wesee that combining both parameters on average bene-fits both versions of the algorithm in achieving a commu-nity partitioning of high modularity more efficiently andconsistently

B Hierarchical amp Overlapping Communities

Communities in certain networks are known to be hier-archical For instance students in the same classes oftenform some strong local communities while these commu-nities say of the same school in turn form a larger butrelatively weaker community As discussed in Section IImost CNM-based algorithms are inherently hierarchicalsince communities are agglomerated by greedy local op-timization of modularity gain

We present two simple modifications to the originalmethod to enable the detection of hierarchical commu-nities Firstly let us consider the application of hop at-tenuation on label propagation Suppose we impose avery high hop attenuation at the beginning we expectcommunities of small diameter to form If we then grad-ually relax the attenuation value we should expect thesesmall communities to merge into larger ones In order toachieve this we modify eq (3) as follows

sprimei(Li) = 1 minus δ(dG(O(Li) i)) (4)

where

dG(O(Li) i) = 1 + miniprimeisinNi(Li)

dG(O(Li) iprime) (5)

Essentially instead of receiving the current hop scoresfrom the neighbourhood and carry out a subtraction thescore is now determined by the actual geodesic distance(dG) from the label Lrsquos origin denoted by O(L) and thefunction δ This gives greater flexibility of δ in terms ofgeodesic distances and can facilitate iteration-dependenthop attenuation as required here with slight extra com-putation cost

Our second proposal is inspired from [16] where wecan similarly treat newly combined communities as a sin-gle node and use the number of inter-community edgesas the weight of edges between these ldquofresh condensedrdquonodes Instead of doing this every iteration we can applycertain amount of hop attenuation or hard limit in termsof the diameter of the community and do this after anequilibrium is reached

Fig 9 gives an illustration of the first modificationapplied on a subgraph on the OSN Note that this mod-ification depends very much on the initial labelling of

7

nodes because it determines the initial centers of thesesmall communities

FIG 9 (Color online) Community detection in the OSN(n=3000) by gradually decreasing hop attenuation (δ = 05at the top with Q = 064 δ = 0 at the bottom with Q = 078)Nodes with 3 or less neighbours are filtered to ease the visu-alisation

Another important question which was also briefly ad-dressed in [4] is the problem of overlapping communi-ties [20] ie nodes can often be considered a memberof different communities From previous sections we un-derstood that different asynchronous version of the al-gorithm is capable of generating very different results indifferent runs This is exactly how [4] suggested as a po-tential solution - to re-run the algorithm several timesIn a parallel environment however the results tend to bemuch less fluctuating An initial attempt was to increasethe number of labels passed each time between nodes toachieve a similar effect Preliminary experiments indi-cate limited success since this setting hampers the con-vergence process possibly due to the potential of latentlabels switching back and fro in the system Another pos-sibility is the exploit the fact that nodes on the borderof its community have different proportions (purity) ofneighbours from other communities We can potentiallyuse that as a measure of membership but this indeed mayonly be applicable to such boundary nodes

C Optimization

The individual inspection of every node particularlythose with many neighbours is a crucial factor in deter-mining the speed of the algorithm Putting aside efficientdata structures and prudent programming an obvious

Iteration

Nodes avoided (time saved) 806040

Abs Diff in Q 806040

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

FIG 10 The difference in modularity and speed of theoptimized modifications with the original

optimization we can do without much compromise on theperformance is to selectively update high degree nodesThe reader may have realised that after certain itera-tions it would be pointless to update certain nodes thatare well inside a cluster These nodes are surrounded bynodes with the same label which are unlikely to changefor the same reason We employ a simple purity mea-sure of neighbours to selectively update nodes that areon the borders of their communities In other words weonly update nodes whose number of neighbours sharingthe maximal label is less than a certain percentage In-deed small degree nodes are likely to be avoided in earlyiterations in this setting but their contributions to theoverall community structure and performance are almostinsignificant We carry out the modified algorithm withthresholds set at 100 (equivalent to the unmodified al-gorithm) 80 60 and 40 to examine the trade offbetween accuracy and speed

Figure 10 reveals that after the 1st iteration the ex-tra constraint will increasingly avoid updating nodes Asmore nodes settle in a more stable cluster increasinglyless amount of time will be required in an iteration In-terestingly even with a threshold as low as 40 the ab-solute difference in modularity compared to the originalsetting is reasonably small and we can see the overallrunning time can be significantly reduced

D Parallel amp Online Analysis

Clear advantages of label propagation include its easeto be parallelized and its potential online implementationin real time networks Since each node is required onlyto know information about its neighbours and updatesitself according to the common rules parallelism can beeasily achieved This brings us to another technical pointthat when the algorithm is completely parallelized evenwithout explicit synchronization it would tend to behave

8

like the synchronous version of the algorithm And this isthe key reason why we have stressed in the literature thatimproving both synchronous and asynchronous versionsof the algorithm are equally important

The running time in a parallel environment effectivelyreduces to k if there are Θ(n) machines This can beachieved in real world ubiquitous system such as a mo-bile ad-hoc network (MANET) or potentially on OSNsthemselves (if members are willing to contribute theircomputational power) in real time For instance socialinformation such as the community structure is known tobenefit routing in MANET [21] Moreover in such sce-narios the space requirement for storing link informationwould become decentralised and thus insignificant

On the same note we see great potential in adaptingthe algorithm for online community detection in real-timedynamic networks where the presence of nodes and edgesare constantly evolving The microscopic movements andintermittent presence of nodes contribute to changes interms of weights of the edges These in turn result in fivedistinct macroscopic behaviours of communities namelygrowth shrinkage union division and death of commu-nities The challenge indeed is to detect local changeswithout the need for a global update given limited com-putational resource or time constraint We believe labelpropagation is particularly suited in this paradigm andthus propose this as future work

V COMPARISONS

We first look at two relatively large and previouslystudied networks for comparisons These networks arerespectively the Amazon Purchasing Network analysedin [13] and the actor collaboration network [22] As donein [13] we assume all edges to be undirected to ease theanalysis With the added heuristics the algorithm is ableto perform within 5 of CNM and 10 of the adapta-tion by Danon Dıaz-Guilera and Arenas (CNM-DDA)[14] in terms of modularity (cf Table I) LPA how-ever achieves the result in a matter of minutes which isunparalleled by the above

For a more standardized comparison we turn to the re-cently proposed benchmark graphs by Lancichinetti etal [19] an extension to the well known GN benchmark[12] which incorporates more realistic scale-free degreeand cluster-size distributions We follow closely the im-plementation of the benchmark graphs as described in[19] and compare the original LPA with the improvedversion on the graphs of size 1000 and 5000 To contrastlabel propagation with general fast modularity maximi-sation algorithms we also run the benchmarks on theCNM algorithm

As shown in Fig 11 both implementations achieve su-perior accuracy over CNM in terms of normalised mutualinformation (NMI) even up to a mixing parameter of 06Interestingly the original method shows signs of failureat micro = 05 in the N = 1000 d = 50 benchmark graphs

(cf Fig 11(b)) We believe this corresponds to theformation of monster communities discussed in SectionIVA The number of nodes and the average degree ofthe benchmark graphs in effect dictate the number andsizes of the original communities generated The resultshence point out that denser and less modularized graphsare relatively prone to the formation of monster commu-nities However the application of hop attenuation asexemplified in Fig 11(b) greatly improves the overallperformance of LPA in such scenarios

Importantly as opposed to label propagation we cansee that CNM algorithmrsquos performance does not merelydepend on the mixing parameter but also the averagedegree of the network Resolution limit of modularitymaximization is reflected by CNMrsquos worse performancein graphs having a smaller average degree Although inmost configurations all algorithms expectedly manage touncover a modularity value of a similar magnitude thereal accuracy in terms of NMI does not follow Thisfinding corresponds to the notion in [17] that modularitymaximisation does not simply translate to actual com-munities

VI CONCLUSIONS

In this literature we have empirically analysed a scal-able efficient and accurate community detection algo-rithm We discussed the behaviours and emphasized theimportance of both the synchronous and asynchronousimplementations of the algorithm We suggested poten-tial heuristics that can be applied to improve its aver-age detection performance and adaptability Most im-portantly we contrasted the algorithm with modularity-gain based methods in terms of community detection ac-curacy and observed how it can be potentially appliedonline and concurrently in large-scale and real-time dy-namic networks

Understanding the dynamics of this algorithm wouldbe the major future work of this discipline before onedevises further heuristics to improve the algorithm Webelieve that each notion discussed in Section IV is wor-thy of further inspection An equally important point isto analyse mathematically or empirically on how to bestadapt the algorithm to different types of networks by theadded heuristics How do different network topologiesand models affect the algorithmrsquos convergent behaviourThese are all valuable questions to be investigated in fu-ture work

In summary we show that label propagation with theappropriate modifications is a more reliable and efficientmethod in detecting communities in large-scale networksthan popular existing methods We trust that with fur-ther understanding and analysis epidemic-based commu-nity detection would be of substantial value to the field

9

Network Size Directed Links Q(Claimed) Peak Q(Sync) Peak Q(Async)

Amazon Purchase(Marrsquo03) 409687 4929260 0745 [13] 0724 0727Actor Collaboration 374511 30052912 0528 [4] 0719 [14] 0642 0660

TABLE I The results correspond to the peak modularity achieved in 10 iterations or less with f = Deg and m = 01 and agradually decreasing δ as discussed in Section IV B

CNM LPA LPA-δ

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(a)N = 1000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06N

MI

Mixing parameter micro

(b)N = 1000 d = 50

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(c)N = 5000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(d)N = 5000 d = 50

FIG 11 Average performance comparisons between the three algorithms in the benchmark graphs with size N and averagedegree d Both versions of LPA here are asynchronous LPA-δ implements a gradually decreasing δ as discussed in SectionIVB All benchmark graphs have power-law degree and cluster-size distributions with exponent 3 and 2 For N = 1000 theresults are the average over 100 realisations for N = 5000 over 10 realisations

Acknowledgments

We thank Franco Bagnoli and Vito Latora for help-ful comments We are grateful to Eric Promislow for

providing us with the Amazon network data Networkvisualisations are carried out on Cytoscape [24] Thisproject is supported by EC IST SOCIALNETS - Grantagreement number 217141

[1] J Kleinberg in STOC rsquo00 Proceedings of the thirty-

second annual ACM symposium on Theory of computing

(ACM New York NY USA 2000) pp 163ndash170 ISBN

10

1-58113-184-4[2] D J Watts and S H Strogatz Nature (London) 393

440 (1998)[3] A Mislove M Marcon K P Gummadi P Druschel

and B Bhattacharjee in IMC rsquo07 Proceedings of the 7th

ACM SIGCOMM conference on Internet measurement

(ACM New York NY USA 2007) pp 29ndash42[4] U N Raghavan R Albert and S Kumara Phys Rev

E 76 036106 (pages 11) (2007)[5] D Lusseau and M E J Newman Proc R Soc London

B 271 S477 (2004)[6] G W Flake S Lawrence et al IEEE Computer 35 66

(2002)[7] E Ravasz A L Somera D A Mongru Z N Oltvai

and A L Barabasi Science 297 1551 (2002)[8] M E J Newman Eur Phys J B 38 321 (2004)[9] L Danon J Duch et al J Stat Mech p P09008

(2005)[10] S Boccaletti V Latora Y Moreno M Chavez and D-

U Hwang Phys Rep 424 175 (2006)[11] F Radicchi C Castellano F Cecconi V Loreto and

D Parisi PNAS 101 2658 (2004)[12] M E Newman and M Girvan Phys Rev E 69 026113

(2004)[13] A Clauset M E J Newman and C Moore Phys Rev

E 70 066111 (2004)[14] L Danon A Dıaz-Guilera and A Arenas J Stat Mech

2006 P11010 (2006)[15] K Wakita and T Tsurumi in WWW rsquo07 Proceedings

of the 16th international conference on World Wide Web

(ACM New York NY USA 2007) pp 1275ndash1276[16] V D Blondel J-L Guillaume R Lambiotte and

E Lefebvre J Stat Mech 10 P10008 (2008) 08030476[17] S Fortunato and M Barthelemy PNAS 104 36 (2007)[18] R Albert and A-L Barabasi Rev Mod Phys 74 47

(2002)[19] A Lancichinetti S Fortunato and F Radicchi Phys

Rev E 78 046110 (2008)[20] G Palla I Derenyi et al Nature 435 814 (2005) ISSN

0028-0836[21] P Hui J Crowcroft and E Yoneki in MobiHoc rsquo08

Proceedings of the 9th ACM international symposium on

Mobile ad hoc networking and computing (ACM NewYork NY USA 2008) pp 241ndash250 ISBN 978-1-60558-073-9

[22] A-L Barabasi and R Albert Science 286 509 (1999)[23] See the Facebookcom Six Degrees Project[24] httpwwwcytoscapeorg

7

nodes because it determines the initial centers of thesesmall communities

FIG 9 (Color online) Community detection in the OSN(n=3000) by gradually decreasing hop attenuation (δ = 05at the top with Q = 064 δ = 0 at the bottom with Q = 078)Nodes with 3 or less neighbours are filtered to ease the visu-alisation

Another important question which was also briefly ad-dressed in [4] is the problem of overlapping communi-ties [20] ie nodes can often be considered a memberof different communities From previous sections we un-derstood that different asynchronous version of the al-gorithm is capable of generating very different results indifferent runs This is exactly how [4] suggested as a po-tential solution - to re-run the algorithm several timesIn a parallel environment however the results tend to bemuch less fluctuating An initial attempt was to increasethe number of labels passed each time between nodes toachieve a similar effect Preliminary experiments indi-cate limited success since this setting hampers the con-vergence process possibly due to the potential of latentlabels switching back and fro in the system Another pos-sibility is the exploit the fact that nodes on the borderof its community have different proportions (purity) ofneighbours from other communities We can potentiallyuse that as a measure of membership but this indeed mayonly be applicable to such boundary nodes

C Optimization

The individual inspection of every node particularlythose with many neighbours is a crucial factor in deter-mining the speed of the algorithm Putting aside efficientdata structures and prudent programming an obvious

Iteration

Nodes avoided (time saved) 806040

Abs Diff in Q 806040

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

FIG 10 The difference in modularity and speed of theoptimized modifications with the original

optimization we can do without much compromise on theperformance is to selectively update high degree nodesThe reader may have realised that after certain itera-tions it would be pointless to update certain nodes thatare well inside a cluster These nodes are surrounded bynodes with the same label which are unlikely to changefor the same reason We employ a simple purity mea-sure of neighbours to selectively update nodes that areon the borders of their communities In other words weonly update nodes whose number of neighbours sharingthe maximal label is less than a certain percentage In-deed small degree nodes are likely to be avoided in earlyiterations in this setting but their contributions to theoverall community structure and performance are almostinsignificant We carry out the modified algorithm withthresholds set at 100 (equivalent to the unmodified al-gorithm) 80 60 and 40 to examine the trade offbetween accuracy and speed

Figure 10 reveals that after the 1st iteration the ex-tra constraint will increasingly avoid updating nodes Asmore nodes settle in a more stable cluster increasinglyless amount of time will be required in an iteration In-terestingly even with a threshold as low as 40 the ab-solute difference in modularity compared to the originalsetting is reasonably small and we can see the overallrunning time can be significantly reduced

D Parallel amp Online Analysis

Clear advantages of label propagation include its easeto be parallelized and its potential online implementationin real time networks Since each node is required onlyto know information about its neighbours and updatesitself according to the common rules parallelism can beeasily achieved This brings us to another technical pointthat when the algorithm is completely parallelized evenwithout explicit synchronization it would tend to behave

8

like the synchronous version of the algorithm And this isthe key reason why we have stressed in the literature thatimproving both synchronous and asynchronous versionsof the algorithm are equally important

The running time in a parallel environment effectivelyreduces to k if there are Θ(n) machines This can beachieved in real world ubiquitous system such as a mo-bile ad-hoc network (MANET) or potentially on OSNsthemselves (if members are willing to contribute theircomputational power) in real time For instance socialinformation such as the community structure is known tobenefit routing in MANET [21] Moreover in such sce-narios the space requirement for storing link informationwould become decentralised and thus insignificant

On the same note we see great potential in adaptingthe algorithm for online community detection in real-timedynamic networks where the presence of nodes and edgesare constantly evolving The microscopic movements andintermittent presence of nodes contribute to changes interms of weights of the edges These in turn result in fivedistinct macroscopic behaviours of communities namelygrowth shrinkage union division and death of commu-nities The challenge indeed is to detect local changeswithout the need for a global update given limited com-putational resource or time constraint We believe labelpropagation is particularly suited in this paradigm andthus propose this as future work

V COMPARISONS

We first look at two relatively large and previouslystudied networks for comparisons These networks arerespectively the Amazon Purchasing Network analysedin [13] and the actor collaboration network [22] As donein [13] we assume all edges to be undirected to ease theanalysis With the added heuristics the algorithm is ableto perform within 5 of CNM and 10 of the adapta-tion by Danon Dıaz-Guilera and Arenas (CNM-DDA)[14] in terms of modularity (cf Table I) LPA how-ever achieves the result in a matter of minutes which isunparalleled by the above

For a more standardized comparison we turn to the re-cently proposed benchmark graphs by Lancichinetti etal [19] an extension to the well known GN benchmark[12] which incorporates more realistic scale-free degreeand cluster-size distributions We follow closely the im-plementation of the benchmark graphs as described in[19] and compare the original LPA with the improvedversion on the graphs of size 1000 and 5000 To contrastlabel propagation with general fast modularity maximi-sation algorithms we also run the benchmarks on theCNM algorithm

As shown in Fig 11 both implementations achieve su-perior accuracy over CNM in terms of normalised mutualinformation (NMI) even up to a mixing parameter of 06Interestingly the original method shows signs of failureat micro = 05 in the N = 1000 d = 50 benchmark graphs

(cf Fig 11(b)) We believe this corresponds to theformation of monster communities discussed in SectionIVA The number of nodes and the average degree ofthe benchmark graphs in effect dictate the number andsizes of the original communities generated The resultshence point out that denser and less modularized graphsare relatively prone to the formation of monster commu-nities However the application of hop attenuation asexemplified in Fig 11(b) greatly improves the overallperformance of LPA in such scenarios

Importantly as opposed to label propagation we cansee that CNM algorithmrsquos performance does not merelydepend on the mixing parameter but also the averagedegree of the network Resolution limit of modularitymaximization is reflected by CNMrsquos worse performancein graphs having a smaller average degree Although inmost configurations all algorithms expectedly manage touncover a modularity value of a similar magnitude thereal accuracy in terms of NMI does not follow Thisfinding corresponds to the notion in [17] that modularitymaximisation does not simply translate to actual com-munities

VI CONCLUSIONS

In this literature we have empirically analysed a scal-able efficient and accurate community detection algo-rithm We discussed the behaviours and emphasized theimportance of both the synchronous and asynchronousimplementations of the algorithm We suggested poten-tial heuristics that can be applied to improve its aver-age detection performance and adaptability Most im-portantly we contrasted the algorithm with modularity-gain based methods in terms of community detection ac-curacy and observed how it can be potentially appliedonline and concurrently in large-scale and real-time dy-namic networks

Understanding the dynamics of this algorithm wouldbe the major future work of this discipline before onedevises further heuristics to improve the algorithm Webelieve that each notion discussed in Section IV is wor-thy of further inspection An equally important point isto analyse mathematically or empirically on how to bestadapt the algorithm to different types of networks by theadded heuristics How do different network topologiesand models affect the algorithmrsquos convergent behaviourThese are all valuable questions to be investigated in fu-ture work

In summary we show that label propagation with theappropriate modifications is a more reliable and efficientmethod in detecting communities in large-scale networksthan popular existing methods We trust that with fur-ther understanding and analysis epidemic-based commu-nity detection would be of substantial value to the field

9

Network Size Directed Links Q(Claimed) Peak Q(Sync) Peak Q(Async)

Amazon Purchase(Marrsquo03) 409687 4929260 0745 [13] 0724 0727Actor Collaboration 374511 30052912 0528 [4] 0719 [14] 0642 0660

TABLE I The results correspond to the peak modularity achieved in 10 iterations or less with f = Deg and m = 01 and agradually decreasing δ as discussed in Section IV B

CNM LPA LPA-δ

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(a)N = 1000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06N

MI

Mixing parameter micro

(b)N = 1000 d = 50

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(c)N = 5000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(d)N = 5000 d = 50

FIG 11 Average performance comparisons between the three algorithms in the benchmark graphs with size N and averagedegree d Both versions of LPA here are asynchronous LPA-δ implements a gradually decreasing δ as discussed in SectionIVB All benchmark graphs have power-law degree and cluster-size distributions with exponent 3 and 2 For N = 1000 theresults are the average over 100 realisations for N = 5000 over 10 realisations

Acknowledgments

We thank Franco Bagnoli and Vito Latora for help-ful comments We are grateful to Eric Promislow for

providing us with the Amazon network data Networkvisualisations are carried out on Cytoscape [24] Thisproject is supported by EC IST SOCIALNETS - Grantagreement number 217141

[1] J Kleinberg in STOC rsquo00 Proceedings of the thirty-

second annual ACM symposium on Theory of computing

(ACM New York NY USA 2000) pp 163ndash170 ISBN

10

1-58113-184-4[2] D J Watts and S H Strogatz Nature (London) 393

440 (1998)[3] A Mislove M Marcon K P Gummadi P Druschel

and B Bhattacharjee in IMC rsquo07 Proceedings of the 7th

ACM SIGCOMM conference on Internet measurement

(ACM New York NY USA 2007) pp 29ndash42[4] U N Raghavan R Albert and S Kumara Phys Rev

E 76 036106 (pages 11) (2007)[5] D Lusseau and M E J Newman Proc R Soc London

B 271 S477 (2004)[6] G W Flake S Lawrence et al IEEE Computer 35 66

(2002)[7] E Ravasz A L Somera D A Mongru Z N Oltvai

and A L Barabasi Science 297 1551 (2002)[8] M E J Newman Eur Phys J B 38 321 (2004)[9] L Danon J Duch et al J Stat Mech p P09008

(2005)[10] S Boccaletti V Latora Y Moreno M Chavez and D-

U Hwang Phys Rep 424 175 (2006)[11] F Radicchi C Castellano F Cecconi V Loreto and

D Parisi PNAS 101 2658 (2004)[12] M E Newman and M Girvan Phys Rev E 69 026113

(2004)[13] A Clauset M E J Newman and C Moore Phys Rev

E 70 066111 (2004)[14] L Danon A Dıaz-Guilera and A Arenas J Stat Mech

2006 P11010 (2006)[15] K Wakita and T Tsurumi in WWW rsquo07 Proceedings

of the 16th international conference on World Wide Web

(ACM New York NY USA 2007) pp 1275ndash1276[16] V D Blondel J-L Guillaume R Lambiotte and

E Lefebvre J Stat Mech 10 P10008 (2008) 08030476[17] S Fortunato and M Barthelemy PNAS 104 36 (2007)[18] R Albert and A-L Barabasi Rev Mod Phys 74 47

(2002)[19] A Lancichinetti S Fortunato and F Radicchi Phys

Rev E 78 046110 (2008)[20] G Palla I Derenyi et al Nature 435 814 (2005) ISSN

0028-0836[21] P Hui J Crowcroft and E Yoneki in MobiHoc rsquo08

Proceedings of the 9th ACM international symposium on

Mobile ad hoc networking and computing (ACM NewYork NY USA 2008) pp 241ndash250 ISBN 978-1-60558-073-9

[22] A-L Barabasi and R Albert Science 286 509 (1999)[23] See the Facebookcom Six Degrees Project[24] httpwwwcytoscapeorg

8

like the synchronous version of the algorithm And this isthe key reason why we have stressed in the literature thatimproving both synchronous and asynchronous versionsof the algorithm are equally important

The running time in a parallel environment effectivelyreduces to k if there are Θ(n) machines This can beachieved in real world ubiquitous system such as a mo-bile ad-hoc network (MANET) or potentially on OSNsthemselves (if members are willing to contribute theircomputational power) in real time For instance socialinformation such as the community structure is known tobenefit routing in MANET [21] Moreover in such sce-narios the space requirement for storing link informationwould become decentralised and thus insignificant

On the same note we see great potential in adaptingthe algorithm for online community detection in real-timedynamic networks where the presence of nodes and edgesare constantly evolving The microscopic movements andintermittent presence of nodes contribute to changes interms of weights of the edges These in turn result in fivedistinct macroscopic behaviours of communities namelygrowth shrinkage union division and death of commu-nities The challenge indeed is to detect local changeswithout the need for a global update given limited com-putational resource or time constraint We believe labelpropagation is particularly suited in this paradigm andthus propose this as future work

V COMPARISONS

We first look at two relatively large and previouslystudied networks for comparisons These networks arerespectively the Amazon Purchasing Network analysedin [13] and the actor collaboration network [22] As donein [13] we assume all edges to be undirected to ease theanalysis With the added heuristics the algorithm is ableto perform within 5 of CNM and 10 of the adapta-tion by Danon Dıaz-Guilera and Arenas (CNM-DDA)[14] in terms of modularity (cf Table I) LPA how-ever achieves the result in a matter of minutes which isunparalleled by the above

For a more standardized comparison we turn to the re-cently proposed benchmark graphs by Lancichinetti etal [19] an extension to the well known GN benchmark[12] which incorporates more realistic scale-free degreeand cluster-size distributions We follow closely the im-plementation of the benchmark graphs as described in[19] and compare the original LPA with the improvedversion on the graphs of size 1000 and 5000 To contrastlabel propagation with general fast modularity maximi-sation algorithms we also run the benchmarks on theCNM algorithm

As shown in Fig 11 both implementations achieve su-perior accuracy over CNM in terms of normalised mutualinformation (NMI) even up to a mixing parameter of 06Interestingly the original method shows signs of failureat micro = 05 in the N = 1000 d = 50 benchmark graphs

(cf Fig 11(b)) We believe this corresponds to theformation of monster communities discussed in SectionIVA The number of nodes and the average degree ofthe benchmark graphs in effect dictate the number andsizes of the original communities generated The resultshence point out that denser and less modularized graphsare relatively prone to the formation of monster commu-nities However the application of hop attenuation asexemplified in Fig 11(b) greatly improves the overallperformance of LPA in such scenarios

Importantly as opposed to label propagation we cansee that CNM algorithmrsquos performance does not merelydepend on the mixing parameter but also the averagedegree of the network Resolution limit of modularitymaximization is reflected by CNMrsquos worse performancein graphs having a smaller average degree Although inmost configurations all algorithms expectedly manage touncover a modularity value of a similar magnitude thereal accuracy in terms of NMI does not follow Thisfinding corresponds to the notion in [17] that modularitymaximisation does not simply translate to actual com-munities

VI CONCLUSIONS

In this literature we have empirically analysed a scal-able efficient and accurate community detection algo-rithm We discussed the behaviours and emphasized theimportance of both the synchronous and asynchronousimplementations of the algorithm We suggested poten-tial heuristics that can be applied to improve its aver-age detection performance and adaptability Most im-portantly we contrasted the algorithm with modularity-gain based methods in terms of community detection ac-curacy and observed how it can be potentially appliedonline and concurrently in large-scale and real-time dy-namic networks

Understanding the dynamics of this algorithm wouldbe the major future work of this discipline before onedevises further heuristics to improve the algorithm Webelieve that each notion discussed in Section IV is wor-thy of further inspection An equally important point isto analyse mathematically or empirically on how to bestadapt the algorithm to different types of networks by theadded heuristics How do different network topologiesand models affect the algorithmrsquos convergent behaviourThese are all valuable questions to be investigated in fu-ture work

In summary we show that label propagation with theappropriate modifications is a more reliable and efficientmethod in detecting communities in large-scale networksthan popular existing methods We trust that with fur-ther understanding and analysis epidemic-based commu-nity detection would be of substantial value to the field

9

Network Size Directed Links Q(Claimed) Peak Q(Sync) Peak Q(Async)

Amazon Purchase(Marrsquo03) 409687 4929260 0745 [13] 0724 0727Actor Collaboration 374511 30052912 0528 [4] 0719 [14] 0642 0660

TABLE I The results correspond to the peak modularity achieved in 10 iterations or less with f = Deg and m = 01 and agradually decreasing δ as discussed in Section IV B

CNM LPA LPA-δ

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(a)N = 1000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06N

MI

Mixing parameter micro

(b)N = 1000 d = 50

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(c)N = 5000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(d)N = 5000 d = 50

FIG 11 Average performance comparisons between the three algorithms in the benchmark graphs with size N and averagedegree d Both versions of LPA here are asynchronous LPA-δ implements a gradually decreasing δ as discussed in SectionIVB All benchmark graphs have power-law degree and cluster-size distributions with exponent 3 and 2 For N = 1000 theresults are the average over 100 realisations for N = 5000 over 10 realisations

Acknowledgments

We thank Franco Bagnoli and Vito Latora for help-ful comments We are grateful to Eric Promislow for

providing us with the Amazon network data Networkvisualisations are carried out on Cytoscape [24] Thisproject is supported by EC IST SOCIALNETS - Grantagreement number 217141

[1] J Kleinberg in STOC rsquo00 Proceedings of the thirty-

second annual ACM symposium on Theory of computing

(ACM New York NY USA 2000) pp 163ndash170 ISBN

10

1-58113-184-4[2] D J Watts and S H Strogatz Nature (London) 393

440 (1998)[3] A Mislove M Marcon K P Gummadi P Druschel

and B Bhattacharjee in IMC rsquo07 Proceedings of the 7th

ACM SIGCOMM conference on Internet measurement

(ACM New York NY USA 2007) pp 29ndash42[4] U N Raghavan R Albert and S Kumara Phys Rev

E 76 036106 (pages 11) (2007)[5] D Lusseau and M E J Newman Proc R Soc London

B 271 S477 (2004)[6] G W Flake S Lawrence et al IEEE Computer 35 66

(2002)[7] E Ravasz A L Somera D A Mongru Z N Oltvai

and A L Barabasi Science 297 1551 (2002)[8] M E J Newman Eur Phys J B 38 321 (2004)[9] L Danon J Duch et al J Stat Mech p P09008

(2005)[10] S Boccaletti V Latora Y Moreno M Chavez and D-

U Hwang Phys Rep 424 175 (2006)[11] F Radicchi C Castellano F Cecconi V Loreto and

D Parisi PNAS 101 2658 (2004)[12] M E Newman and M Girvan Phys Rev E 69 026113

(2004)[13] A Clauset M E J Newman and C Moore Phys Rev

E 70 066111 (2004)[14] L Danon A Dıaz-Guilera and A Arenas J Stat Mech

2006 P11010 (2006)[15] K Wakita and T Tsurumi in WWW rsquo07 Proceedings

of the 16th international conference on World Wide Web

(ACM New York NY USA 2007) pp 1275ndash1276[16] V D Blondel J-L Guillaume R Lambiotte and

E Lefebvre J Stat Mech 10 P10008 (2008) 08030476[17] S Fortunato and M Barthelemy PNAS 104 36 (2007)[18] R Albert and A-L Barabasi Rev Mod Phys 74 47

(2002)[19] A Lancichinetti S Fortunato and F Radicchi Phys

Rev E 78 046110 (2008)[20] G Palla I Derenyi et al Nature 435 814 (2005) ISSN

0028-0836[21] P Hui J Crowcroft and E Yoneki in MobiHoc rsquo08

Proceedings of the 9th ACM international symposium on

Mobile ad hoc networking and computing (ACM NewYork NY USA 2008) pp 241ndash250 ISBN 978-1-60558-073-9

[22] A-L Barabasi and R Albert Science 286 509 (1999)[23] See the Facebookcom Six Degrees Project[24] httpwwwcytoscapeorg

9

Network Size Directed Links Q(Claimed) Peak Q(Sync) Peak Q(Async)

Amazon Purchase(Marrsquo03) 409687 4929260 0745 [13] 0724 0727Actor Collaboration 374511 30052912 0528 [4] 0719 [14] 0642 0660

TABLE I The results correspond to the peak modularity achieved in 10 iterations or less with f = Deg and m = 01 and agradually decreasing δ as discussed in Section IV B

CNM LPA LPA-δ

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(a)N = 1000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06N

MI

Mixing parameter micro

(b)N = 1000 d = 50

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(c)N = 5000 d = 15

02

03

04

05

06

07

08

09

1

01 06

Mod

ular

ity Q

03

04

05

06

07

08

09

1

01 02 03 04 05 06

NM

I

Mixing parameter micro

(d)N = 5000 d = 50

FIG 11 Average performance comparisons between the three algorithms in the benchmark graphs with size N and averagedegree d Both versions of LPA here are asynchronous LPA-δ implements a gradually decreasing δ as discussed in SectionIVB All benchmark graphs have power-law degree and cluster-size distributions with exponent 3 and 2 For N = 1000 theresults are the average over 100 realisations for N = 5000 over 10 realisations

Acknowledgments

We thank Franco Bagnoli and Vito Latora for help-ful comments We are grateful to Eric Promislow for

providing us with the Amazon network data Networkvisualisations are carried out on Cytoscape [24] Thisproject is supported by EC IST SOCIALNETS - Grantagreement number 217141

[1] J Kleinberg in STOC rsquo00 Proceedings of the thirty-

second annual ACM symposium on Theory of computing

(ACM New York NY USA 2000) pp 163ndash170 ISBN

10

1-58113-184-4[2] D J Watts and S H Strogatz Nature (London) 393

440 (1998)[3] A Mislove M Marcon K P Gummadi P Druschel

and B Bhattacharjee in IMC rsquo07 Proceedings of the 7th

ACM SIGCOMM conference on Internet measurement

(ACM New York NY USA 2007) pp 29ndash42[4] U N Raghavan R Albert and S Kumara Phys Rev

E 76 036106 (pages 11) (2007)[5] D Lusseau and M E J Newman Proc R Soc London

B 271 S477 (2004)[6] G W Flake S Lawrence et al IEEE Computer 35 66

(2002)[7] E Ravasz A L Somera D A Mongru Z N Oltvai

and A L Barabasi Science 297 1551 (2002)[8] M E J Newman Eur Phys J B 38 321 (2004)[9] L Danon J Duch et al J Stat Mech p P09008

(2005)[10] S Boccaletti V Latora Y Moreno M Chavez and D-

U Hwang Phys Rep 424 175 (2006)[11] F Radicchi C Castellano F Cecconi V Loreto and

D Parisi PNAS 101 2658 (2004)[12] M E Newman and M Girvan Phys Rev E 69 026113

(2004)[13] A Clauset M E J Newman and C Moore Phys Rev

E 70 066111 (2004)[14] L Danon A Dıaz-Guilera and A Arenas J Stat Mech

2006 P11010 (2006)[15] K Wakita and T Tsurumi in WWW rsquo07 Proceedings

of the 16th international conference on World Wide Web

(ACM New York NY USA 2007) pp 1275ndash1276[16] V D Blondel J-L Guillaume R Lambiotte and

E Lefebvre J Stat Mech 10 P10008 (2008) 08030476[17] S Fortunato and M Barthelemy PNAS 104 36 (2007)[18] R Albert and A-L Barabasi Rev Mod Phys 74 47

(2002)[19] A Lancichinetti S Fortunato and F Radicchi Phys

Rev E 78 046110 (2008)[20] G Palla I Derenyi et al Nature 435 814 (2005) ISSN

0028-0836[21] P Hui J Crowcroft and E Yoneki in MobiHoc rsquo08

Proceedings of the 9th ACM international symposium on

Mobile ad hoc networking and computing (ACM NewYork NY USA 2008) pp 241ndash250 ISBN 978-1-60558-073-9

[22] A-L Barabasi and R Albert Science 286 509 (1999)[23] See the Facebookcom Six Degrees Project[24] httpwwwcytoscapeorg

10

1-58113-184-4[2] D J Watts and S H Strogatz Nature (London) 393

440 (1998)[3] A Mislove M Marcon K P Gummadi P Druschel

and B Bhattacharjee in IMC rsquo07 Proceedings of the 7th

ACM SIGCOMM conference on Internet measurement

(ACM New York NY USA 2007) pp 29ndash42[4] U N Raghavan R Albert and S Kumara Phys Rev

E 76 036106 (pages 11) (2007)[5] D Lusseau and M E J Newman Proc R Soc London

B 271 S477 (2004)[6] G W Flake S Lawrence et al IEEE Computer 35 66

(2002)[7] E Ravasz A L Somera D A Mongru Z N Oltvai

and A L Barabasi Science 297 1551 (2002)[8] M E J Newman Eur Phys J B 38 321 (2004)[9] L Danon J Duch et al J Stat Mech p P09008

(2005)[10] S Boccaletti V Latora Y Moreno M Chavez and D-

U Hwang Phys Rep 424 175 (2006)[11] F Radicchi C Castellano F Cecconi V Loreto and

D Parisi PNAS 101 2658 (2004)[12] M E Newman and M Girvan Phys Rev E 69 026113

(2004)[13] A Clauset M E J Newman and C Moore Phys Rev

E 70 066111 (2004)[14] L Danon A Dıaz-Guilera and A Arenas J Stat Mech

2006 P11010 (2006)[15] K Wakita and T Tsurumi in WWW rsquo07 Proceedings

of the 16th international conference on World Wide Web

(ACM New York NY USA 2007) pp 1275ndash1276[16] V D Blondel J-L Guillaume R Lambiotte and

E Lefebvre J Stat Mech 10 P10008 (2008) 08030476[17] S Fortunato and M Barthelemy PNAS 104 36 (2007)[18] R Albert and A-L Barabasi Rev Mod Phys 74 47

(2002)[19] A Lancichinetti S Fortunato and F Radicchi Phys

Rev E 78 046110 (2008)[20] G Palla I Derenyi et al Nature 435 814 (2005) ISSN

0028-0836[21] P Hui J Crowcroft and E Yoneki in MobiHoc rsquo08

Proceedings of the 9th ACM international symposium on

Mobile ad hoc networking and computing (ACM NewYork NY USA 2008) pp 241ndash250 ISBN 978-1-60558-073-9

[22] A-L Barabasi and R Albert Science 286 509 (1999)[23] See the Facebookcom Six Degrees Project[24] httpwwwcytoscapeorg