Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
http://cs224w.stanford.edu
2
Network Adjacency matrix
Nodes
Nod
es
Non-overlapping vs. overlapping communities
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
A node belongs to many social “circles’
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
[Palla et al., ‘05]
5
High school Company
Stanford (Squash) Stanford (Basketball)
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Two nodes belong to the same community if they can be connected through adjacent k-cliques: k-clique: Fully connected
graph on k nodes
Adjacent k-cliques: overlap in k-1 nodes
k-clique community Set of nodes that can
be reached through a sequence of adjacent k-cliques
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
3-clique Adjacent 3-cliques
[Palla et al., ‘05]
Non-adjacent 3-cliques
Two nodes belong to the same community if they can be connected through adjacent k-cliques:
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
4-clique
Adjacent 4-cliques
Communities for k=4
[Palla et al., ‘05]
Non-adjacent 4-cliques
Clique Percolation Method: Find maximal-cliques
(not k-cliques!) Clique overlap graph: Each clique is a node Connect two cliques if they
overlap in at least k-1 nodes Communities: Connected components of
the clique overlap matrix How to set k? Set k so that we get the “richest” (most widely
distributed cluster sizes) community structure
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
A
C D B
A
C D B
Cliques Communities
Set: k=3
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
(1) Graph (2) Clique overlap matrix
(3) Thresholded matrix at 3
(4) Communities (connected components)
Start with graph Find maximal
cliques Create clique
overlap matrix Threshold the
matrix at value k-1 If 𝑎𝑎𝑖𝑖𝑖𝑖 < 𝑘𝑘 − 1 set 0
Communities are the connected components of the thresholded matrix
Cliques
Cliq
ues
Overlap size
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
Communities in a “tiny” part of a phone call network of 4 million users [Palla et al., ‘07]
[Palla et al., ‘07]
No nice way, hard combinatorial problem Maximal clique: Clique that can’t be extended {𝑎𝑎, 𝑏𝑏, 𝑐𝑐} is a clique but not maximal clique {𝑎𝑎, 𝑏𝑏, 𝑐𝑐,𝑑𝑑} is maximal clique
Algorithm: Sketch Start with a seed node Expand the clique around the seed Once the clique cannot be further
expanded we found the maximal clique Note: This will generate the same clique multiple times
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
Start with a seed vertex 𝒂𝒂 Goal: Find the max clique 𝑸𝑸 that 𝒂𝒂 belongs to Observation: If some 𝒙𝒙 belongs to 𝑸𝑸 then it is a neighbor of 𝒂𝒂 Why? If 𝒂𝒂,𝒙𝒙 ∈ 𝑸𝑸 but edge (𝒂𝒂,𝒙𝒙) does not exist, 𝑸𝑸 is not a clique!
Recursive algorithm: 𝑸𝑸 … current clique 𝑹𝑹 … candidate vertices to expand the clique to
Example: Start with 𝒂𝒂 and expand around it
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Q= {a} {a,b} {a,b,c} bktrack {a,b,d} R= {b,c,d} {b,c,d} {c,d}∩Γ(c)={} {c}∩Γ(d)={} ∩Γ(b)={c,d} Steps of the recursive algorithm Γ(u)…neighbor set of u
Start with a seed vertex 𝒂𝒂 Goal: Find the max clique 𝑸𝑸 that 𝒂𝒂 belongs to Observation: If some 𝒙𝒙 belongs to 𝑸𝑸 then it is a neighbor of 𝒂𝒂 Why? If 𝒂𝒂,𝒙𝒙 ∈ 𝑸𝑸 but edge (𝒂𝒂,𝒙𝒙) does not exist, 𝑸𝑸 is not a clique!
Recursive algorithm: 𝑸𝑸 … current clique 𝑹𝑹 … candidate vertices to expand the clique to
Example: Start with 𝒂𝒂 and expand around it
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
Q= {a} {a,b} {a,b,c} bktrack {a,b,d} R= {b,c,d} {b,c,d} {d}∩Γ(c)={} {c}∩Γ(d)={} ∩Γ(b)={c,d} Steps of the recursive algorithm Γ(u)…neighbor set of u
𝑸𝑸 … current clique 𝑹𝑹 … candidate vertices
Expand(R,Q) while R ≠ {} p = vertex in R
Qp = Q ∪ {p} Rp = R ∩ Γ(p) if Rp ≠ {}: Expand(Rp,Qp) else: output Qp R = R – {p}
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16
Start: Expand(V, {}) R={a…f}, Q={} p = {a} Qp = {a} Rp = {b,d} Expand(Rp, Q): R = {b,d}, Q={a} p = {b} Qp = {a,b} Rp = {d} Expand(Rp, Q): R = {d}, Q={a,b} p = {d} Qp = {a,b,d} Rp = {} : output {a,b,d} p = {d} Qp = {a,d} Rp = {b} Expand(Rp, Q): R = {b}, Q={a,d} p = {b} Qp = {a,d} Rp = {} : output {a,d,b}
Announcement: On Thu we will have Lars Backstrom from Facebook Data Science team give a guest lecture!
How should we think about large scale organization of clusters in networks? Finding: There is a lack of (traditional) clustering
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20 Nested Core-Periphery
How do we reconcile these two views? (and still do community detection)
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
vs.
How community like is a set of nodes? A good cluster S has Many edges internally Few edges pointing outside
Simplest objective function: Conductance Small conductance corresponds to good clusters (Note |S| < |V|/2)
22
S
S’
∑∈
∉∈∈=
Sssd
SjSiEjiS |},;),{(|)(φ
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Define: Network community profile (NCP) plot
Plot the score of best community of size k
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
Community size, log k
log Φ(k)
k=5 k=7
[WWW ‘08]
k=10
(Note |S| < |V|/2)
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
• Run the favorite clustering method • Each dot represents a cluster • For each size find “best” cluster
Cluster size, log k
Clu
ster
scor
e, lo
g Φ
(k)
Spectral Graclus
Metis
Meshes, grids, dense random graphs:
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
d-dimensional meshes California road network
11/19/2013
[WWW ‘08]
Collaborations between scientists in networks [Newman, 2005]
26
Community size, log k
Cond
ucta
nce,
log Φ
(k)
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/19/2013
[WWW ‘08]
Natural hypothesis about NCP: NCP of real networks slopes
downward Slope of the NCP corresponds
to the “dimensionality“ of the network
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
What about large networks?
[Internet Mathematics ‘09]
Typical example: General Relativity collaborations (n=4,158, m=13,422)
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
[Internet Mathematics ‘09]
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
[Internet Mathematics ‘09]
Φ(k
), (s
core
)
k, (cluster size) 30
Better and better clusters
Clusters get worse and worse
Best cluster has ~100 nodes
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
As clusters grow the number of edges inside grows slower that the number crossing
31
Φ=2/10 = 0.2
Each node has twice as many children
Φ=1/7=0.14
Φ=8/20 = 0.4
Φ=64/92 = 0.69
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Empirically we note that best clusters are barely connected to the network
32
NCP plot
⇒ Core-periphery structure 11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
33
Nothing happens! ⇒ Nestedness of the
core-periphery structure 11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Nested Core-Periphery (jellyfish, octopus)
Whiskers are responsible for
good communities
Denser and denser core
of the network
Core contains 60% node and
80% edges
34 11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
35
vs.
How do we reconcile these two views?
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Many methods for overlapping communities Clique percolation [Palla et al. ‘05] Link clustering [Ahn et al. ‘10] [Evans et al.‘09] Clique expansion [Lee et al. ‘10] Mixed membership stochastic block models
[Airoldi et al. ’08] Bayesian matrix factorization [Psorakis et al. ‘11]
What do these methods assume about community overlaps?
36
Many overlapping community detection methods make an implicit assumption:
Edge probability decreases with the number of shared communities
37
Network Adjacency matrix
Nodes
Nod
es
Is this true?
Basic question: nodes u, v share k communities What’s the edge probability?
38
LiveJournal social network
Amazon product network
Edge density in the overlaps is higher!
39
Communities as “tiles”
“The more different foci (communities) that two individuals share, the more likely it is that they will be tied” - S. Feld, 1981
40 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/19/2013
Communities as overlapping tiles Web of affiliations [Simmel ‘64]
41
What does this mean?
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/19/2013
Ganovetter and all non-overlapping
methods
Palla et al., MMSB and other
overlapping methods as well
Many methods fail to detect dense overlaps: Clique percolation, …
42 Clique percolation
Generative model: How is a network generated from community affiliations?
Model parameters: Nodes V, Communities C, Memberships M Each community c has a single probability pc
43
Communities, C
Nodes, V Community Affiliation Network
Model pA pB
Memberships, M
Given parameters (V, C, M, {pc}) Nodes in community c connect to each other by
flipping a coin with probability pc
Nodes that belong to multiple communities have multiple coin flips: Dense community overlaps
44
Communities, C
Nodes, V
Community Affiliation Network
Model pA pB
Memberships, M
45
Model
Network
AGM is flexible and can express variety of network structures: Non-overlapping, Nested, Overlapping
11/19/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 46
47 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/19/2013
48 LiveJournal social network
Primary core: Foodwebs, Web-graph, Social Secondary cores: PPI, Products
49
Foodweb PPI