Collaborative Similarity Measure for Intra-Graph Clustering

The 17th International Conference on Database Systems for Advanced Applications, Busan, South Korea.

The 3rd International Workshop on Social Networks and Social Web Mining*

Collaborative Similarity Measure for Intra-Graph Clustering*

Waqas Nawaz, Young-Koo Lee, Sungyoung Lee

Department of Computer Engineering, Kyung Hee University, Korea

Presenter

Waqas Nawaz

Thursday, April 13, 2023

Data and Knowledge Engineering (DKE) Lab, Kyung Hee University Korea

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Agenda

Motivation

Proposed Method (CSM-IGC)

Experiments

Conclusion & Future Directions

2

Related Work


Graphs with Multiple Attributes

3

Coauthor Network of Top 200 Authors on TEL from DBLP from manyeyes.alphaworks.ibm.com

Attribute of Authors


Related Work

Structure based clusteringNormalized cuts [Shi and Malik, TPAMI 2000]Modularity [Newman and Girvan, Phys. Rev. 2004] Scan [Xu et al., KDD'07] The clusters generated have a rather random distribution of vertex properties within clusters

OLAP-style graph aggregation K-SNAP [Tian et al., SIGMOD’08]Attributes compatible groupingThe clusters generated have a rather loose intra-cluster structure

4


Example: A Coauthor Network

r1. XML

r2. XMLr3. XML, Skyline

r4. XML

r5. XMLr6. XML

r7. XML r8. XML

r9. Skyline

r10. Skyline r11. Skyline

Traditional Coauthor graph

r1. XML


r4. XML

r5. XMLr6. XML

r7. XML r8. XML

r9. Skyline


Structure-based Cluster

r1. XML


r4. XML

r5. XMLr6. XML

r7. XML r8. XML

r9. Skyline


Attribute-based Cluster

r1. XML


r4. XML

r5. XMLr6. XML

r7. XML r8. XML

r9. Skyline


Structural/Attribute Cluster

5

*http

s://w

iki.e

ngr.

illin

ois.

edu/

dow

nloa

d/at

tach

men

ts/1

8638

4385

/VLD

B09

_not

es.p

pt


Related Work (cont…)

Structure/Attribute based clusteringSA-Cluster [Yang Zhou et al., VLDB’ 2009]

• Modify the structure of the original graph– add dummy vertex w.r.t each attribute instance– Sparse matrix and space inefficient

• Neighborhood random walk: Matrix multiplication is performed iteratively

• Fixed edge weights, and automatically update attribute weights

Scalability issue for medium & large graphs (time complexity)

6


Two-Fold Objective

A desired clustering of attributed graph should achieve a good balance between the following:

Structural cohesiveness: Vertices within one cluster are close to each other in terms of structure, while vertices between clusters are distant from each other

Attribute homogeneity: Vertices within one cluster have similar attribute values, while vertices between clusters have quite different attribute values

And it should be Scalable to medium scale graphs

7


Different Graph Clustering Approaches

Structure-based ClusteringVertices with heterogeneous values in a cluster

Attribute-based Clustering Lose much structure information

Structural/Attribute ClusterHomogeneous vertices along structure information at the

expense time complexity

Intra-Graph Clustering Scalable while considering both aspects

8


Proposed Solution

System Architecture Diagram

9

INPUT Processing Phase OUTPUT


Phase 1

Similarity Estimation (Inspired by Jaccard Index1) Interaction of vertices (topology or structure)

• Weighted fraction of shared neighbors

• It will be zero for disconnected vertices• Example: Structural similarity among

– SIM(V1, V2) = (1/3)*5 = 1.667– SIM(V1, V3) = (1/4)*4 = 1.0– SIM(V2, V3) = (1/4)*3 = 0.75– V1 & V4 = (1/4)*0 = 0.0

• Transitive Property…!– SIM(V1, V4) = SIM(V1,V3) * SIM(V3,V4)

10

1P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Sociètè Vaudoise des Sciences Naturelles, Vol.37, (1901)


Transitive Property

Lemma 1 (Transitivity): Let ᵱ = {} be a path from source vertex to target vertex. Then

for all the intermediate vertices i =1, 2…, q

Proof: It is based on the fact that the similarity value lies in the

interval [0, 1].

11

¿ (𝑣𝑎 ,𝑣𝑏)=∏𝑖=1

𝑞

𝑠𝑖𝑚 (𝑣 ᵱ 𝑖 ,𝑣 ᵱ 𝑖+1 )≤𝑠𝑖𝑚 (𝑣 ᵱ 𝑖 ,𝑣 ᵱ 𝑖+1 )


Phase 1 (cont…)

Similarity Estimation (Inspired by Jaccard Index1)Context of vertices (attributes regularity)

• Weighted fraction of shared attributes instances

• It will be zero for contextually disjoint vertices • Example: Contextual similarity among

– Lets Wa1 = 1 and Wa2 = 2 then– SIM(V1, V3) = (2/2) = 1.0– SIM(V3, V4) = (1/2) = 0.5– V1 & V4 = 0.0

12

1P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Sociètè Vaudoise des Sciences Naturelles, Vol.37, (1901)

𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕𝒘𝒆𝒊𝒈𝒉𝒕𝒆𝒅=

∏𝒊=𝟏 ,𝒗 𝒂∧¿ 𝒗𝒃←𝒂 𝒊

𝑴

(𝒘 𝒂𝒊)

∏𝒋=𝟏 ,𝒗 𝒂∨¿ 𝒗𝒃←𝒂 𝒋

𝑴

(𝒘 𝒂 𝒋),𝒗𝒂↔𝒗𝒃∨¿ 𝒗𝒂⋯𝒗𝒃


Collaborative Similarity Measure

Structural

Contextual

Collaborative Measure

13

𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕𝒘𝒆𝒊𝒈𝒉𝒕𝒆𝒅=

∏𝒊=𝟏 ,𝒗 𝒂∧¿ 𝒗𝒃←𝒂 𝒊

𝑴

(𝒘 𝒂𝒊)

∏𝒋=𝟏 ,𝒗 𝒂∨¿ 𝒗𝒃←𝒂 𝒋

𝑴

(𝒘 𝒂 𝒋),𝒗𝒂↔𝒗𝒃∨¿ 𝒗𝒂⋯𝒗𝒃

𝐂𝐨𝐥𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐢𝐯𝐞𝐒𝐢𝐦𝐢𝐥𝐚𝐫𝐢𝐭𝐲=𝐂𝐒𝐢𝐦 (𝒗𝒂 ,𝒗𝒃 )=¿ {𝜶∗𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒔𝒕𝒓𝒖𝒄𝒕+(𝟏−𝜶 )∗𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕 ,𝒗𝒂↔𝒗𝒃

¿∏𝒊=𝟏

𝒒

𝑪𝑺𝒊𝒎 (𝒗𝒑𝒊 ,𝒗𝒑𝒊+𝟏) ,𝒗𝒂⋯𝒗𝒃 ,𝒗𝒑 𝒊𝒔 𝒐𝒏𝒑𝒂𝒕𝒉𝒗𝒂𝒂𝒏𝒅𝒗𝒃

¿(𝟏−𝜶)∗𝑺𝑰𝑴 (𝒗𝒂 , 𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕 ,𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆


Phase 2

Clustering (K-Medoid Approach)

14

4. Update the centroids by maximizing SIM distances

3. Evaluate the quality of each cluster

2. Assign vertices to their nearest centroids

1. Randomly choose centroids for K clusters


Algorithm Details

15

Node Clustering

Similarity Calculation

Iterative

Single Pass


(a) (b) (c)

𝐂𝐒𝐢𝐦ሺ𝒗𝒂 ,𝒗𝒃 ሻ

vertex V1 V2 V3 V4 V5 V6

V1 1 2.67 1.17 0.20 0.18 0.18

V2 2.67 1 0.92 0.15 0.14 0.14

V3 1.17 0.92 1 0.17 0.15 0.15

V4 0.2 0.15 0.17 1 0.92 0.92

V5 0.18 0.14 0.15 0.92 1 2.5

V6 0.18 0.14 0.15 0.92 2.5 1

K Clustered Vertices Density Entropy

2 {V1,V2,V3},{V4,V5,V6} 0.42 0.133

3 {V1,V3},{V2},{V4,V5,V6} 0.28 0.084

4 {V5},{V6},{V4},{V1,V2,V3} 0.21 0.084

(a) (b)

Example

16

Fig. 3. Scenarios for similarity between source (green) and destination(red) nodes following some intermediate nodes (yellow) (a) No direct path exist (b) Directly connected (c) In-directly connected, shortest path

Table 2. (a) Collaborative Similarity among vertices given in Fig. 3-c using Collaborative Similarity Measure, (b) Clustering results by varying number of clusters (K), quality of each measure is calculated using Density and Entropy


Experiments

Real DatasetPolitical Blogs Dataset: 1490 vertices, 19090 edges, one

attribute political leaning • Liberal• Conservative

MethodsK-SNAP: Attributes only S-Cluster: Structure-based clusteringW-Cluster: Weighted random walk strategy SA-Cluster: Consider both factors (matrix manipulation) IGC-CSM: Our proposed method

17


Evaluation Metrics

Density*: intra-cluster structural cohesiveness

Entropy*: intra-cluster attribute homogeneity

18

*Yang Zhou et al.,Graph Clustering Based on Structural/Attribute Similarities,Proceedings of VLDB Endowment,France (2009)


Evaluation Metrics (cont…)

F-Measure*: has the ability to evaluate the collective qualitative nature of the formed cluster

19

where

and

*Tijn Witsenburg et al., Improving the Accuracy of Similarity Measures by Using Link Information, International Symposium on Methodologies for Intelligent Systems Edition 9, Poland (2011)


Results (Time Complexity)

Synthetic Dataset Varying No. of Node

Real DatasetPolitical Blog*No. of Clusters vs. Time

20

Graph size vs. time

*htt

p:/

/ww

w-p

ers

on

al.

um

ich

.ed

u/m

ejn

/ne

tda

ta


Results (Quality)

Density EvaluationClusters vs. Density Value

Entropy EvaluationClusters vs. Entropy Value

21


Results (Quality)

F-Measure EstimationClusters vs. F-measure Value

22


Conclusion

We study the problem of graph node clustering based on homogeneous characteristics in terms of context and topology collaborative similarity measure to reflect the relational

model among pair of vertices k-Medoid clustering framework is adopted for grouping

similar nodesThe resulting solution is estimated using state of the

art evaluation measures:Density, Entropy, and F-measure

Comparatively scalable to medium scale graphs without compromising on the quality of results

23

Data & Knowledge Engineering LabData & Knowledge Engineering Lab 24

[email protected]@khu.ac.kr

[email protected]

ThanksAny Question…?

[email protected]@khu.ac.kr

[email protected]

Education

Collaborative Similarity Measure for Intra-Graph Clustering