Upload
ticien
View
4.095
Download
0
Tags:
Embed Size (px)
Citation preview
The Noise Cluster Model:a Greedy Solution to the Network Communities
Extraction Problem
Etienne Come,[email protected]
&Eustache Diemert,
6 octobre 2010
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 1 / 32
Outline
1 Introduction
2 Existing solutions for the community extraction problem
3 Background on Erdos-Renyi mixture
4 The noise cluster model
5 Community extraction using the noise cluster model
6 Preliminary experiments : Blogs communities extraction
7 Conclusion & future works
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 2 / 32
Introduction
Introduction
MotivationsI Extract one community using seeds nodes from the community
I On-line algorithm (do not store the full graph)
Solution : Community extractionI extract one community
I semi-supervised method : some community members are known
Solution : Noise cluster modelI simple generative model
I one community surrounded by noise
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 3 / 32
Introduction
Introduction, (toy example)
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 4 / 32
Introduction
Introduction, (graph clustering)
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 5 / 32
Introduction
Introduction, (seeds)
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 6 / 32
Introduction
Introduction, (community extraction)
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 7 / 32
Introduction
Introduction, (community extraction)
Useless
Usefull
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 8 / 32
Introduction
Advantages
I seeds give a focus to process the graph
I better complexity
I the exploration of the full graph can be avoided
I no problem of balance between communities size
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 9 / 32
Existing solutions for the community extraction problem
Existing solutions for the community extraction problem
Bagrow & al [BB05]
I growing a breadth first tree outward from one seed node ;
I until the rate of expansion falls below an arbitrary threshold. (i.e. theproportion of edges found at the current level which lead to nodeswhich are yet unknown)
ProblemsI can only deal with one seed
I all node of a level are included
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 10 / 32
Existing solutions for the community extraction problem
Existing solutions for the community extraction problem
Clauset [Cla05]
I greedy optimization of a quantity called local modularity Lmod ;
I boundary B : the subset of known nodes that have at least oneneighbour in the set of yet unknown nodes ;
I local modularity : number of edges between this set and the set ofknown nodes C over the total number of edges with one extremity inthis set.
Lmod =
∑i∈C,j∈B Bij +
∑i∈B,j∈C Bij∑
i ,j Bij, (1)
with Bij = 1 if i and j are connected and either vertex is in B.
ProblemsI can only deal with one seed
I stopping criteria tuning
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 11 / 32
Existing solutions for the community extraction problem
Existing solutions for the community extraction problem
Other solutionsI [AL06] random walks and conductances
I [SG10] combinatorial algorithms
ProblemsI complexity scales with the graph size
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 12 / 32
Background on Erdos-Renyi mixture
Graph clustering
Generative setting (Erdos-Renyi mixture, block-model)
Variables definition :
I Xij are binary variables defining presence // absence of link from nodei to node j :
xij =
{1, if there is a link from i to j
0, otherwise.(2)
I Zjk are dummy variables encoding cluster membership, they take theirvalues zjk :
zjk =
{1, if j belongs to cluster k
0, otherwise.(3)
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 13 / 32
Background on Erdos-Renyi mixture
Erdos-Renyi mixture
Model definition [DPS08]
Zjki .i .d∼ M(1, γ), ∀i ∈ {1, . . . ,N} (4)
Xij |Zik × Zjl = 1i .i .d∼ B(πkl), ∀i , j ∈ {1, . . . ,N} (5)
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 14 / 32
Background on Erdos-Renyi mixture
Erdos-Renyi mixture
Figure: Adjacency matrix simulated using an Erdos-Renyi mixture
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 15 / 32
The noise cluster model
The noise cluster model
Model definition
Zii .i .d∼ B(γ), ∀i ∈ {1, . . . ,N}, (6)
Xij |Zi × Zj = 1i .i .d∼ B(α), ∀i , j ∈ {1, . . . ,N}, (7)
Xij |Zi × Zj = 0i .i .d∼ B(β), ∀i , j ∈ {1, . . . ,N}, (8)
with zi = 1, if i belongs to the community and 0 otherwise.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 16 / 32
The noise cluster model
The noise cluster model
Figure: Adjacency matrix simulated using the noise cluster model.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 17 / 32
The noise cluster model
Basics quantities
I Community size :
Nc =∑i
zi
I Nodes degrees :
d inj =
∑i :zi=1
xij , doutj =
∑i :zi=1
xji , dj =∑i :zi=1
(xij + xji )
I Posteriors probabilities :
pinj = P(Zj = 1|Xij = xij ,Zi = zi , ∀i ∈ {1, . . . ,N}),poutj = P(Zj = 1|Xji = xji ,Zi = zi , ∀i ∈ {1, . . . ,N}),
pin,outj = P(Zj = 1|Xij = xij ,Xji = xji ,Zi = zi , ∀i ∈ {1, . . . ,N}),
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 18 / 32
The noise cluster model
Simplifications :
Community membership posterior probabilities are the quantities ofinterest to determine if a node must be added to the community. Theydepend uniquely on :
I parameters (α, β, γ) ;
I links with community members (d inj , d
outj , d in,out
j respectively) ;
I community size (Nc) ;
Example for pinj
pinj =αd in
j × (1− α)(Nc−d inj ) × γ
αd inj × (1− α)(Nc−d in
j ) × γ + βdinj × (1− β)(Nc−d in
j ) × (1− γ)
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 19 / 32
The noise cluster model
Community membership test
Community membership test equivalent to threshold the number of sharedlinks with community members.
{pinj > s} ⇔ {d inj > dmin}, (9)
with
dmin =
⌊log(s × (1− β)Nc × (1− γ)
)− log
((1− s)× (1− α)Nc × γ
)log (α× (1− β))− log ((1− α)× β)
⌋
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 20 / 32
The noise cluster model
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 10 20 30 40 50
0.0
0.4
0.8
alpha=0.1,beta=0.001,gamma=0.05,Nc=200
din
pc
0 100 200 300 400
24
68
10
alpha=0.1,beta=0.001,gamma=0.05
Nc
dmin
Figure: (top) values of pinj with respect to d inj with α = 0.1,
β = 0.001, γ = 0.05 and Nc = 200 ; (bottom) dmin evolution with respect to thecommunity size Nc with α = 0.1, β = 0.001, γ = 0.05 and s = 0.5.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 21 / 32
Community extraction using the noise cluster model
Online learning [ZAM08]
Classification likelihood
In the case of a full adjacency matrix, the classification log-likelihood isdefined as :
Lc(X,Z, θ) =∑i
zi log(γ) +∑i
(1− zi ) log(1− γ)
+∑i ,j :i 6=j
zi × zj × xij log(α) +∑i ,j :i 6=j
zi × zj(1−×xij) log(1− α)
+∑i ,j :i 6=j
(1− zi × zj)× xij log(β) +∑i ,j :i 6=j
(1− zi × zj)× (1− xij) log(1− β)
with Z = {z1, . . . , zN}, X = {xij : i 6= j , i , j ∈ {1, . . . ,N}}, andθ = (γ, α, β) the parameters vector.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 22 / 32
Community extraction using the noise cluster model
Online learning [ZAM08]
Maximisation for known partition
If the partition Z = {z1, . . . , zN} is known and with a square adjacencymatrix of size N × N, the parameter vector maximizing the Classificationlikelihood is given by :
γ =Nc
N, (10)
α =1
N2c
N∑i ,j=1, i 6=j
(zi × zj)xij , (11)
β =1
Nc × (N + Nc)
N∑i ,j=1, i 6=j
(1− zi × zj)xij , (12)
with Nc the number of nodes belonging to the community and Nc thenumber of nodes that do not belong to the community.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 23 / 32
Community extraction using the noise cluster model
Proposed community extraction procedure
Algorithm
Use a breadth first algorithm to explore the graph starting from the seeds,for each traversed vertex :
1 use community membership test (9) to add it or not to thecommunity
2 update parameters (using 10, 11, 12), taking into account the currentpartition
until no more vertex can be added to the community.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 24 / 32
Preliminary experiments : Blogs communities extraction
Preliminary experiments : Blogs communities extraction
Settings
I multi-threaded web crawler coupled with the proposed communityextraction procedure ;
I seeds URLs taken from Wikio (http ://www.wikio.com) whichproposes several rankings of blogs for several topics ;
I theses ranking were used to provide 100 or 50 seeds to the algorithmfor 4 test communities.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 25 / 32
Preliminary experiments : Blogs communities extraction
Blogs communities extraction
Comics (Fr) Scrapbooking (Fr) Food (U.S.) Politics (U.S.)Nb seed 100 100 50 50Nc 1 263 1 130 1 681 1 884Nb edges 20 434 24 248 100 597 74 219α 0.01821 0.01899 0.03560 0.02091β 0.00093 0.00147 0.00091 0.00065γ 0.03048 0.05579 0.03060 0.01808Biggest S.C.C. 1 251 1 129 1 667 1 877Max Level 3 2 5 4Diameter 6 7 7 8Radius 4 4 4 3Clustering Coeff. 0.287 0.265 0.381 0.320Transitivity 0.198 0.2 0.290 0.223
Table: Global statistics and model parameters for 4 communities.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 26 / 32
Preliminary experiments : Blogs communities extraction
Blogs community extraction
names level
1 www.bouletcorp.com 02 louromano.blogspot.com 23 www.cartoonbrew.com 24 yacinfields.blogspot.com 15 polyminthe.blogspot.com 16 marnette.canalblog.com 17 blackwingdiaries.blogspot.com 28 bastienvives.blogspot.com 19 donshank.blogspot.com 2
10 john-nevarez.blogspot.com 2
Table: Best site according to local page rank for the Comics (fr) community
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 27 / 32
Preliminary experiments : Blogs communities extraction
Figure: Word clouds for Politics (us). The first 50 words in descending order oftheir Kullback-Leibler divergence are kept(between word document frequency inthe community and in a negative class of 10000 random blogs, texts have beenfirst preprocessed using a stop list and stemming). Words size are proportional tothe word document frequencies in the community.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 28 / 32
Preliminary experiments : Blogs communities extraction
Figure: Word clouds for Food (us). The first 50 words in descending order oftheir Kullback-Leibler divergence are kept(between word document frequency inthe community and in a negative class of 10000 random blogs, texts have beenfirst preprocessed using a stop list and stemming). Words size are proportional tothe word document frequencies in the community.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 29 / 32
Conclusion & future works
Conclusion & future works
ConclusionI simple, greedy approach ;
I complexity scales with the community size not the graph size ;
I blog community extraction was performed using such a tool withsuccess.
Future works
More work is needed to better understand and evaluate the approach :
I test the robustness of the methods to noise in the seeds set ;
I test with other application domains (with ground truth) ;
I test using graph generation algorithms.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 30 / 32
Conclusion & future works
R. Andersen and K. Lang.Communities from seed sets.In Proceedings of the 15th International Conference on World Wide Web, pages 223–232.ACM Press, 2006.
J.P. Bagrow and E.M. Bollt.A local method for detecting communities.Phys Rev E Stat Nonlin Soft Matter Phys, 72(4) :046108, 2005.
A. Clauset.Finding local community structure in networks.Phys Rev E Stat Nonlin Soft Matter Phys, 72(2) :026132, 2005.
J. Daudin, F. Picard, and Robin S.A mixture model for random graph.Statistics and computing, 18 :1–36, 2008.
M. Sozio and A. Gionis.The community-search problem and how to plan a successful cocktail party.In Proceedings of the 16th ACM SIGKDD Conference On Knowledge Discovery and DataMining (KDD), pages –, 2010.
H. Zanghi, C. Ambroise, and V. Miele.Fast online graph clustering via erdos-renyi mixture.Pattern Recognition, 41(12) :3592–3599, December 2008.
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 31 / 32
Conclusion & future works
Thanks for your attention !
Come & Diemert (INRETS, BestOfMedia) The Noise Cluster Model 6 octobre 2010 32 / 32