32
The Noise Cluster Model: a Greedy Solution to the Network Communities Extraction Problem Etienne Cˆ ome, [email protected] & Eustache Diemert, [email protected] 6 octobre 2010 Cˆome&Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 1 / 32

Marami 2010

  • Upload
    ticien

  • View
    4.095

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Marami 2010

The Noise Cluster Model:a Greedy Solution to the Network Communities

Extraction Problem

Etienne Come,[email protected]

&Eustache Diemert,

[email protected]

6 octobre 2010

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 1 / 32

Page 2: Marami 2010

Outline

1 Introduction

2 Existing solutions for the community extraction problem

3 Background on Erdos-Renyi mixture

4 The noise cluster model

5 Community extraction using the noise cluster model

6 Preliminary experiments : Blogs communities extraction

7 Conclusion & future works

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 2 / 32

Page 3: Marami 2010

Introduction

Introduction

MotivationsI Extract one community using seeds nodes from the community

I On-line algorithm (do not store the full graph)

Solution : Community extractionI extract one community

I semi-supervised method : some community members are known

Solution : Noise cluster modelI simple generative model

I one community surrounded by noise

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 3 / 32

Page 4: Marami 2010

Introduction

Introduction, (toy example)

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 4 / 32

Page 5: Marami 2010

Introduction

Introduction, (graph clustering)

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 5 / 32

Page 6: Marami 2010

Introduction

Introduction, (seeds)

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 6 / 32

Page 7: Marami 2010

Introduction

Introduction, (community extraction)

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 7 / 32

Page 8: Marami 2010

Introduction

Introduction, (community extraction)

Useless

Usefull

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 8 / 32

Page 9: Marami 2010

Introduction

Advantages

I seeds give a focus to process the graph

I better complexity

I the exploration of the full graph can be avoided

I no problem of balance between communities size

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 9 / 32

Page 10: Marami 2010

Existing solutions for the community extraction problem

Existing solutions for the community extraction problem

Bagrow & al [BB05]

I growing a breadth first tree outward from one seed node ;

I until the rate of expansion falls below an arbitrary threshold. (i.e. theproportion of edges found at the current level which lead to nodeswhich are yet unknown)

ProblemsI can only deal with one seed

I all node of a level are included

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 10 / 32

Page 11: Marami 2010

Existing solutions for the community extraction problem

Existing solutions for the community extraction problem

Clauset [Cla05]

I greedy optimization of a quantity called local modularity Lmod ;

I boundary B : the subset of known nodes that have at least oneneighbour in the set of yet unknown nodes ;

I local modularity : number of edges between this set and the set ofknown nodes C over the total number of edges with one extremity inthis set.

Lmod =

∑i∈C,j∈B Bij +

∑i∈B,j∈C Bij∑

i ,j Bij, (1)

with Bij = 1 if i and j are connected and either vertex is in B.

ProblemsI can only deal with one seed

I stopping criteria tuning

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 11 / 32

Page 12: Marami 2010

Existing solutions for the community extraction problem

Existing solutions for the community extraction problem

Other solutionsI [AL06] random walks and conductances

I [SG10] combinatorial algorithms

ProblemsI complexity scales with the graph size

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 12 / 32

Page 13: Marami 2010

Background on Erdos-Renyi mixture

Graph clustering

Generative setting (Erdos-Renyi mixture, block-model)

Variables definition :

I Xij are binary variables defining presence // absence of link from nodei to node j :

xij =

{1, if there is a link from i to j

0, otherwise.(2)

I Zjk are dummy variables encoding cluster membership, they take theirvalues zjk :

zjk =

{1, if j belongs to cluster k

0, otherwise.(3)

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 13 / 32

Page 14: Marami 2010

Background on Erdos-Renyi mixture

Erdos-Renyi mixture

Model definition [DPS08]

Zjki .i .d∼ M(1, γ), ∀i ∈ {1, . . . ,N} (4)

Xij |Zik × Zjl = 1i .i .d∼ B(πkl), ∀i , j ∈ {1, . . . ,N} (5)

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 14 / 32

Page 15: Marami 2010

Background on Erdos-Renyi mixture

Erdos-Renyi mixture

Figure: Adjacency matrix simulated using an Erdos-Renyi mixture

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 15 / 32

Page 16: Marami 2010

The noise cluster model

The noise cluster model

Model definition

Zii .i .d∼ B(γ), ∀i ∈ {1, . . . ,N}, (6)

Xij |Zi × Zj = 1i .i .d∼ B(α), ∀i , j ∈ {1, . . . ,N}, (7)

Xij |Zi × Zj = 0i .i .d∼ B(β), ∀i , j ∈ {1, . . . ,N}, (8)

with zi = 1, if i belongs to the community and 0 otherwise.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 16 / 32

Page 17: Marami 2010

The noise cluster model

The noise cluster model

Figure: Adjacency matrix simulated using the noise cluster model.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 17 / 32

Page 18: Marami 2010

The noise cluster model

Basics quantities

I Community size :

Nc =∑i

zi

I Nodes degrees :

d inj =

∑i :zi=1

xij , doutj =

∑i :zi=1

xji , dj =∑i :zi=1

(xij + xji )

I Posteriors probabilities :

pinj = P(Zj = 1|Xij = xij ,Zi = zi , ∀i ∈ {1, . . . ,N}),poutj = P(Zj = 1|Xji = xji ,Zi = zi , ∀i ∈ {1, . . . ,N}),

pin,outj = P(Zj = 1|Xij = xij ,Xji = xji ,Zi = zi , ∀i ∈ {1, . . . ,N}),

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 18 / 32

Page 19: Marami 2010

The noise cluster model

Simplifications :

Community membership posterior probabilities are the quantities ofinterest to determine if a node must be added to the community. Theydepend uniquely on :

I parameters (α, β, γ) ;

I links with community members (d inj , d

outj , d in,out

j respectively) ;

I community size (Nc) ;

Example for pinj

pinj =αd in

j × (1− α)(Nc−d inj ) × γ

αd inj × (1− α)(Nc−d in

j ) × γ + βdinj × (1− β)(Nc−d in

j ) × (1− γ)

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 19 / 32

Page 20: Marami 2010

The noise cluster model

Community membership test

Community membership test equivalent to threshold the number of sharedlinks with community members.

{pinj > s} ⇔ {d inj > dmin}, (9)

with

dmin =

⌊log(s × (1− β)Nc × (1− γ)

)− log

((1− s)× (1− α)Nc × γ

)log (α× (1− β))− log ((1− α)× β)

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 20 / 32

Page 21: Marami 2010

The noise cluster model

● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 10 20 30 40 50

0.0

0.4

0.8

alpha=0.1,beta=0.001,gamma=0.05,Nc=200

din

pc

0 100 200 300 400

24

68

10

alpha=0.1,beta=0.001,gamma=0.05

Nc

dmin

Figure: (top) values of pinj with respect to d inj with α = 0.1,

β = 0.001, γ = 0.05 and Nc = 200 ; (bottom) dmin evolution with respect to thecommunity size Nc with α = 0.1, β = 0.001, γ = 0.05 and s = 0.5.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 21 / 32

Page 22: Marami 2010

Community extraction using the noise cluster model

Online learning [ZAM08]

Classification likelihood

In the case of a full adjacency matrix, the classification log-likelihood isdefined as :

Lc(X,Z, θ) =∑i

zi log(γ) +∑i

(1− zi ) log(1− γ)

+∑i ,j :i 6=j

zi × zj × xij log(α) +∑i ,j :i 6=j

zi × zj(1−×xij) log(1− α)

+∑i ,j :i 6=j

(1− zi × zj)× xij log(β) +∑i ,j :i 6=j

(1− zi × zj)× (1− xij) log(1− β)

with Z = {z1, . . . , zN}, X = {xij : i 6= j , i , j ∈ {1, . . . ,N}}, andθ = (γ, α, β) the parameters vector.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 22 / 32

Page 23: Marami 2010

Community extraction using the noise cluster model

Online learning [ZAM08]

Maximisation for known partition

If the partition Z = {z1, . . . , zN} is known and with a square adjacencymatrix of size N × N, the parameter vector maximizing the Classificationlikelihood is given by :

γ =Nc

N, (10)

α =1

N2c

N∑i ,j=1, i 6=j

(zi × zj)xij , (11)

β =1

Nc × (N + Nc)

N∑i ,j=1, i 6=j

(1− zi × zj)xij , (12)

with Nc the number of nodes belonging to the community and Nc thenumber of nodes that do not belong to the community.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 23 / 32

Page 24: Marami 2010

Community extraction using the noise cluster model

Proposed community extraction procedure

Algorithm

Use a breadth first algorithm to explore the graph starting from the seeds,for each traversed vertex :

1 use community membership test (9) to add it or not to thecommunity

2 update parameters (using 10, 11, 12), taking into account the currentpartition

until no more vertex can be added to the community.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 24 / 32

Page 25: Marami 2010

Preliminary experiments : Blogs communities extraction

Preliminary experiments : Blogs communities extraction

Settings

I multi-threaded web crawler coupled with the proposed communityextraction procedure ;

I seeds URLs taken from Wikio (http ://www.wikio.com) whichproposes several rankings of blogs for several topics ;

I theses ranking were used to provide 100 or 50 seeds to the algorithmfor 4 test communities.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 25 / 32

Page 26: Marami 2010

Preliminary experiments : Blogs communities extraction

Blogs communities extraction

Comics (Fr) Scrapbooking (Fr) Food (U.S.) Politics (U.S.)Nb seed 100 100 50 50Nc 1 263 1 130 1 681 1 884Nb edges 20 434 24 248 100 597 74 219α 0.01821 0.01899 0.03560 0.02091β 0.00093 0.00147 0.00091 0.00065γ 0.03048 0.05579 0.03060 0.01808Biggest S.C.C. 1 251 1 129 1 667 1 877Max Level 3 2 5 4Diameter 6 7 7 8Radius 4 4 4 3Clustering Coeff. 0.287 0.265 0.381 0.320Transitivity 0.198 0.2 0.290 0.223

Table: Global statistics and model parameters for 4 communities.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 26 / 32

Page 27: Marami 2010

Preliminary experiments : Blogs communities extraction

Blogs community extraction

names level

1 www.bouletcorp.com 02 louromano.blogspot.com 23 www.cartoonbrew.com 24 yacinfields.blogspot.com 15 polyminthe.blogspot.com 16 marnette.canalblog.com 17 blackwingdiaries.blogspot.com 28 bastienvives.blogspot.com 19 donshank.blogspot.com 2

10 john-nevarez.blogspot.com 2

Table: Best site according to local page rank for the Comics (fr) community

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 27 / 32

Page 28: Marami 2010

Preliminary experiments : Blogs communities extraction

Figure: Word clouds for Politics (us). The first 50 words in descending order oftheir Kullback-Leibler divergence are kept(between word document frequency inthe community and in a negative class of 10000 random blogs, texts have beenfirst preprocessed using a stop list and stemming). Words size are proportional tothe word document frequencies in the community.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 28 / 32

Page 29: Marami 2010

Preliminary experiments : Blogs communities extraction

Figure: Word clouds for Food (us). The first 50 words in descending order oftheir Kullback-Leibler divergence are kept(between word document frequency inthe community and in a negative class of 10000 random blogs, texts have beenfirst preprocessed using a stop list and stemming). Words size are proportional tothe word document frequencies in the community.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 29 / 32

Page 30: Marami 2010

Conclusion & future works

Conclusion & future works

ConclusionI simple, greedy approach ;

I complexity scales with the community size not the graph size ;

I blog community extraction was performed using such a tool withsuccess.

Future works

More work is needed to better understand and evaluate the approach :

I test the robustness of the methods to noise in the seeds set ;

I test with other application domains (with ground truth) ;

I test using graph generation algorithms.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 30 / 32

Page 31: Marami 2010

Conclusion & future works

R. Andersen and K. Lang.Communities from seed sets.In Proceedings of the 15th International Conference on World Wide Web, pages 223–232.ACM Press, 2006.

J.P. Bagrow and E.M. Bollt.A local method for detecting communities.Phys Rev E Stat Nonlin Soft Matter Phys, 72(4) :046108, 2005.

A. Clauset.Finding local community structure in networks.Phys Rev E Stat Nonlin Soft Matter Phys, 72(2) :026132, 2005.

J. Daudin, F. Picard, and Robin S.A mixture model for random graph.Statistics and computing, 18 :1–36, 2008.

M. Sozio and A. Gionis.The community-search problem and how to plan a successful cocktail party.In Proceedings of the 16th ACM SIGKDD Conference On Knowledge Discovery and DataMining (KDD), pages –, 2010.

H. Zanghi, C. Ambroise, and V. Miele.Fast online graph clustering via erdos-renyi mixture.Pattern Recognition, 41(12) :3592–3599, December 2008.

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 31 / 32

Page 32: Marami 2010

Conclusion & future works

Thanks for your attention !

Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 32 / 32