1
Cluster Ranking with an Application to Mining Mailbox Networks
Ziv Bar-Yossef Technion, Google
Ido Guy Technion, IBM
Ronny Lempel IBM
Yoelle Maarek Google
Vova Soroka IBM
2
Clustering
A network: undirected graph with non-negative edge weights w(u,v): “Similarity” between u and v. Do not necessarily correspond to a proper metric
Induced distance may not respect the triangle’s inequality
Examples: Social networks. w(u,v) = strength of relationship between u and v. Biological networks. w(u,v) = genetic similarity between species u and v. Document networks. w(u,v) = topical similarity between u and v. Image networks. w(u,v) = color similarity/proximity between u and v.
Clustering: partitioning of the network into regions of similarity Communities in social networks Species families in biological networks Groups of documents on the same topic. Segments of an image.
3
The cluster abundance problem
Problem:
Sometimes clustering algorithm produces masses of clusters.Large networksFuzzy/soft clustering
Needle in a haystack problem – which are the important clusters?
4
Cluster ranking
Goals:Define a cluster strength measure
Assigns a strength score to each subset of nodes
Design cluster ranking algorithm Outputs the clusters in the network, ordered by
their strength
5
A simple example
strength(C) = |C|, if C is a clique. strength(C) = 0, if C is not a clique.
Cluster ranking: {a,b,c}, {d,e,f} {c,g}, {g,f}
ag
eb
c
d
f
6
Our contributions
Cluster ranking framework New cluster strength measure
Properly captures similarity among cluster members Applicable to both weighted and unweighted networks Arbitrary similarity weights Efficiently computable
Cluster ranking algorithm Application to mining communities in “personal
mailbox networks”
7
Cluster strength measure:Unweighted networks
Which is a stronger cluster? Cohesion = measure of strength for unweighted
clusters Cohesive cluster = does not “easily” break into pieces
G1 G2
8
Edge separators
Edge separator:
A subset of the network’s edges whose removal breaks the network into two or more connected components.
All previous work:cohesion(C) = “density” of “sparsest” edge separator
Different notions of density for edge separators: Conductance [KannanVempalaVetta00] Normalized cut [ShiMalik00] Relative neighborhoods [FlakeLawrenceGiles00] Edge betweenness [GirvanNewman02] Modularity [GirvanNewman04]
9
Edge separators are not good enough True: sparse edge separator noncohesive
cluster
False: no sparse edge separator cohesive cluster
Clique of size m
Clique of size m
v
u vClique of size m
Clique of size m
10
Vertex separator:
A subset of the network’s vertices whose removal breaks the network into two or more connected components.
Our strength measure:cohesion(C) = “density” of “sparsest” vertex separator
Separator is “sparse”, if S is small A,B are “balanced”
BA
S
Vertex separators
11
Vertex separators are better
Sparse edge separator sparse vertex separator noncohesive cluster
Sparse vertex separator noncohesive cluster
Clique of size m
Clique of size m
v
u vClique of size m
Clique of size m
12
Cluster strength measure:Weighted networks
Which is a stronger cluster? Cohesion is no longer the sole factor determining cluster
strength
10 10
10
1 1
1
G1 G2
13
Thresholding
Traditional approach for dealing with weighted networks Transforms the weighted network into an unweighted
network by a threshold
Threshold T<1Threshold 1 ≤ T < 5
No threshold is suitable
G1
G2
1
5
GT
GT
G
14
Integrated cohesion
Which is a stronger cluster?
Small T G1 is stronger
Large T G2 is stronger
Integrated cohesion: area under the curve Strong cluster: sustains high cohesion while increasing threshold
Cohesion(GT)
T
1G1
Cohesion(GT)
T
0.7
G2
15
C-Rank - Cluster Ranking Algorithm
Candidate identification
Ranking by strength score
Elimination of non-maximal clusters
16
Candidate identification: Unweighted networks
Given an unweighted network GFind a sparse vertex separator S of GNetwork splits into disconnected components
A1,…,Ak
Clusters = SUA1,…,SUAk
Recurse on SUA1,…,SUAk S
A2
A 4
A3
A5
A1
17
c
Candidate identification - Example
Sparse separator: S = {c,d}
Connected components: A1 = {a,b}, A2 = {e}
Add back {c,d} to A1 and A2
a
b d
e
A1
A2
18
Candidate identification - Example
Sparse separator: S = {c,d}
Connected components: A1 = {a,b}, A2 = {e}
Add back {c,d} to A1 and A2
Since both components are cliques, no recursive calls are made
ca
b d
c
d
e
S U A1
S U A2
19
Mailbox networks
Nodes: contacts appearing in headers of messages in a person’s mailbox Excluding mailbox owner
Edges: connect contacts who co-occur at the same massage header Edge weights: frequency of co-occurrence
This is an egocentric social network Reflects the subjective perspective of the mailbox owner
20
Mining mailbox networks
Motivation Advanced email client features
Automatic group completion and correction Automatic group classification (colleagues, friends, spouse, etc.) Identification of “spam groups” and management of blocked lists
Intelligence & law enforcement Mine mailboxes of suspected terrorists and criminals
Our Goal
Given: A mailbox network G
Output: A ranking of communities in G
21
Ziv Bar-Yossef’s top 10 communities
RankWeight Member IDsDescription
11631,2grad student + co-advisor
2413-19FOCS program committee
339.220,21,22,23,24old car pool
428.520,21,22,23,24,25new car pool
52826,27colleagues
62828,29colleagues
72526,30,31colleagues
81932,33,34department committee
915.935-53jokes forwarding group
101554-67reading group
22
Experiments
Enron Email Dataset (http://www.cs.cmu.edu/~enron/) Made publicly available during the investigation of
Enron fraud ~150 mailboxes of Enron employees More than 500,000 messages
Compared with another clustering algorithm EB-Rank - Adaptation the popular edge betweenness
algorithm [GirvanNewman02] to our framework
23
Relative recall
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10
Deciles of networks ordered by size
Med
ian
rec
all
Outbox C-Rank Outbox EB-Rank Inbox C-Rank Inbox EB-Rank
24
Score comparison
0
2
4
6
8
10
12
14
16
5 10 15 20 25 30 35 40 45 50
K
Med
ian
sco
re o
f to
p K
co
mm
un
itie
s
Outbox C-Rank Outbox EB-Rank Inbox C-Rank Inbox EB-Rank
25
Conclusions
The cluster ranking problem as a novel framework for clustering
Integrated cohesion as a strength measure for overlapping clusters in weighted networks
C-Rank: A new cluster ranking algorithm Application: mining mailbox networks
26
Thank You
27
Integrated cohesion
Which is a stronger cluster?
Note: to compute integral, need only GT for T’s that equal the distinct edge
weights
Cohesion(GT)
T
1G1
Cohesion(GT)
T
0.7
G2
0
)()(int_T
T dTGcohesionGcohesion
28
Integrated cohesion - Example
1
3
15
T
7
1515
55
33
107
Cohesion = 1
3
Cohesion(GT)G
29
Integrated cohesion - Example
1
3 T
3
Cohesion = 0.667
7
0.667
2.333
Cohesion(GT)
15
7
1515
55
107
30
Integrated cohesion - Example
1
3 T
3
Cohesion = 0.333
7
0.667
2.333
Cohesion(GT)
15
1515
10
10
1
int_cohesion(G) = 3 + 2.333 + 1 = 6.333
0.333
31
Cluster subsumption and maximality
C is maximal iff partitioning any super-set of C into clusters leaves C in tact.
S = sparsest separator of C (C1, C2) : induced cover of C
S = sparsest separator of D (D1,D2) : Induced cover of D
C1 D1, C2 D2
D subsumes C C is not maximal
SD1 C1D2C2
D
C
32
Candidate identification: Weighted networks
Apply a threshold T=0 on G a
b d
c
e
2
2
5
2
5
5
22
G
33
c
Candidate identification: Weighted networks
Unweighted candidate identification a
b d
e
G0
34
Candidate identification: Weighted networks
Recurse on ‘abcd’ and ‘cde’ separately
ca
b d
c
d
e
35
Candidate identification: Weighted networks
a
b d
c
5
2
5
5
22
Apply threshold T=2 on ‘abcd’
36
Candidate identification: Weighted networks
a
b d
c
Apply threshold T=2 on ‘abcd’ Recurse on ‘abc’ No recursive call on singleton ‘d’
37
Candidate identification: Weighted networks
a
b
c
Apply threshold T=5on ‘abc’
5
5
5
38
Candidate identification: Weighted networks Apply threshold T=5
on ‘abc’ No recursive call on singletons
‘a’ ,‘b’ ,‘c’
ca
b
39
Candidate identification: Weighted networks Final candidate list:
‘abcde’ ‘abcd’ ‘abc’ ‘cde’
a
b d
c
e
2
2
5
2
5
5
32
40
Computing sparse vertex separators Complexity of Sparsest Vertex Separator
NP-hard Can be approximated in polynomial time via Semi-
Definite Programming [FeigeHajiaghayiLee05]
SDP might be inefficient in practice We find sparse vertex separators via Vertex
Betweenness [Freeman77]
Efficiently computable via dynamic programming Works well empirically In worst-case, approximation can be weak
41
Computing sparse vertex separators Complexity of Sparsest Vertex Separator
NP-hard Can be approximated in polynomial time via Semi-
Definite Programming [FeigeHajiaghayiLee05]
SDP might be inefficient in practice We find sparse vertex separators via Vertex
Betweenness [Freeman77]
Efficiently computable via dynamic programming Works well empirically In worst-case, approximation can be weak
42
Normalized Vertex Betweenness (NVB) [Freeman77]
Vertex Betweenness (VB) of a node v: Number of shortest paths passing through v
Ex: ~m2 for v, 0 for the other vertices
Normalized Vertex Betweenness (NVB): divide by to get values in [0,1]
NVB(G): Maximum NVB value over all nodes
Theorem: cohesion(G) ≥ 1/(1 + |G| · NVB(G))
In practice: cohesion(G) ≈ 1/(1 + |G| · NVB(G))
Clique of size m
Clique of size m
v
2
1n
43
Candidate identification: Weighted networks Ideal algorithm:
Iterate over all possible thresholds T Output all clusters in GT
Somewhat inefficient
Actual algorithm:1) Apply threshold T = min weight in G
2) Output clusters of GT
3) For each clique C in GT
Recurse on C
44
C-Rank: Analysis
Theorem:
C-Rank is guaranteed to output all the maximal clusters.
Lemma:
C-Rank runs in time polynomial in its output length.
45
Mailbox networks
a
b d
c
11
1 1
1
1
1
1
1
1
a b, c, d, and owner c d, e, and owner
An egocentric social network Reflects the subjective perspective of the mailbox owner
Nodes: contacts appearing in message headers Excluding mailbox owner
Edges: connect contacts who co-occur at the same message header Edge weights: frequency of co-occurrence
46
Mailbox networks
a b, c, d, and owner c d, e, and owner b owner
a
b d
c
e
1
11
21
1 2
1
1
1
1
12
An egocentric social network Reflects the subjective perspective of the mailbox owner
Nodes: contacts appearing in message headers Excluding mailbox owner
Edges: connect contacts who co-occur at the same massage header Edge weights: frequency of co-occurrence
47
Mailbox networks
a
b d
c
e
1
11
22
1 2
1
1
1
1
12
a b, c, d, and owner c d, e, and owner b owner
An egocentric social network Reflects the subjective perspective of the mailbox owner
Nodes: contacts appearing in message headers Excluding mailbox owner
Edges: connect contacts who co-occur at the same massage header Edge weights: frequency of co-occurrence
48
Ido Guy’s top 10 communities
RankWeightMember IDsDescription
11841,2project1 core team
2873spouse
3754advisor
470.31,5,6,7project2 core team
5628former advisor
648.21,2,9,10,11,12project1 new team
746.913-25academic course staff
846.71,5,6,7,26-30project2 extended team (IBM)
942.31,2,9,10,31project1 old team
1041.31,5,6,7,26-30,32-35project2 extended team (IBM+Lucent)
49
Estimated precision
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10
Deciles of networks ordered by size
Med
ian
pre
cisi
on
Outbox C-Rank Outbox EB-Rank Inbox C-Rank Inbox EB-Rank