24
The 17th International Conference on Database Systems for Advanced Applications, Busan, South Korea. The 3rd International Workshop on Social Networks and Social Web Mining* Collaborative Similarity Measure for Intra-Graph Clustering* Waqas Nawaz, Young-Koo Lee, Sungyoung Lee Department of Computer Engineering, Kyung Hee University, Korea Presenter Waqas Nawaz 6/27/22 Data and Knowledge Engineering (DKE) Lab, Kyung Hee University Korea

Collaborative Similarity Measure for Intra-Graph Clustering

Embed Size (px)

Citation preview

Page 1: Collaborative Similarity Measure for Intra-Graph Clustering

The 17th International Conference on Database Systems for Advanced Applications, Busan, South Korea.

The 3rd International Workshop on Social Networks and Social Web Mining*

Collaborative Similarity Measure for Intra-Graph Clustering*

Waqas Nawaz, Young-Koo Lee, Sungyoung Lee

Department of Computer Engineering, Kyung Hee University, Korea

Presenter

Waqas Nawaz

Thursday, April 13, 2023

Data and Knowledge Engineering (DKE) Lab, Kyung Hee University Korea

Page 2: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Agenda

Motivation

Proposed Method (CSM-IGC)

Experiments

Conclusion & Future Directions

2

Related Work

Page 3: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Graphs with Multiple Attributes

3

Coauthor Network of Top 200 Authors on TEL from DBLP from manyeyes.alphaworks.ibm.com

Attribute of Authors

Page 4: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Related Work

Structure based clusteringNormalized cuts [Shi and Malik, TPAMI 2000]Modularity [Newman and Girvan, Phys. Rev. 2004] Scan [Xu et al., KDD'07] The clusters generated have a rather random distribution of vertex properties within clusters

OLAP-style graph aggregation K-SNAP [Tian et al., SIGMOD’08]Attributes compatible groupingThe clusters generated have a rather loose intra-cluster structure

4

Page 5: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Example: A Coauthor Network

r1. XML

r2. XMLr3. XML, Skyline

r4. XML

r5. XMLr6. XML

r7. XML r8. XML

r9. Skyline

r10. Skyline r11. Skyline

Traditional Coauthor graph

r1. XML

r2. XMLr3. XML, Skyline

r4. XML

r5. XMLr6. XML

r7. XML r8. XML

r9. Skyline

r10. Skyline r11. Skyline

Structure-based Cluster

r1. XML

r2. XMLr3. XML, Skyline

r4. XML

r5. XMLr6. XML

r7. XML r8. XML

r9. Skyline

r10. Skyline r11. Skyline

Attribute-based Cluster

r1. XML

r2. XMLr3. XML, Skyline

r4. XML

r5. XMLr6. XML

r7. XML r8. XML

r9. Skyline

r10. Skyline r11. Skyline

Structural/Attribute Cluster

5

*http

s://w

iki.e

ngr.

illin

ois.

edu/

dow

nloa

d/at

tach

men

ts/1

8638

4385

/VLD

B09

_not

es.p

pt

Page 6: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Related Work (cont…)

Structure/Attribute based clusteringSA-Cluster [Yang Zhou et al., VLDB’ 2009]

• Modify the structure of the original graph– add dummy vertex w.r.t each attribute instance– Sparse matrix and space inefficient

• Neighborhood random walk: Matrix multiplication is performed iteratively

• Fixed edge weights, and automatically update attribute weights

Scalability issue for medium & large graphs (time complexity)

6

Page 7: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Two-Fold Objective

A desired clustering of attributed graph should achieve a good balance between the following:

Structural cohesiveness: Vertices within one cluster are close to each other in terms of structure, while vertices between clusters are distant from each other

Attribute homogeneity: Vertices within one cluster have similar attribute values, while vertices between clusters have quite different attribute values

And it should be Scalable to medium scale graphs

7

Page 8: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Different Graph Clustering Approaches

Structure-based ClusteringVertices with heterogeneous values in a cluster

Attribute-based Clustering Lose much structure information

Structural/Attribute ClusterHomogeneous vertices along structure information at the

expense time complexity

Intra-Graph Clustering Scalable while considering both aspects

8

Page 9: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Proposed Solution

System Architecture Diagram

9

INPUT Processing Phase OUTPUT

Page 10: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Phase 1

Similarity Estimation (Inspired by Jaccard Index1) Interaction of vertices (topology or structure)

• Weighted fraction of shared neighbors

• It will be zero for disconnected vertices• Example: Structural similarity among

– SIM(V1, V2) = (1/3)*5 = 1.667– SIM(V1, V3) = (1/4)*4 = 1.0– SIM(V2, V3) = (1/4)*3 = 0.75– V1 & V4 = (1/4)*0 = 0.0

• Transitive Property…!– SIM(V1, V4) = SIM(V1,V3) * SIM(V3,V4)

10

1P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Soci`et`e Vaudoise des Sciences Naturelles, Vol.37, (1901)

Page 11: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Transitive Property

Lemma 1 (Transitivity): Let ᵱ = {} be a path from source vertex to target vertex. Then

for all the intermediate vertices i =1, 2…, q

Proof: It is based on the fact that the similarity value lies in the

interval [0, 1].

11

¿ (𝑣𝑎 ,𝑣𝑏)=∏𝑖=1

𝑞

𝑠𝑖𝑚 (𝑣 ᵱ 𝑖 ,𝑣 ᵱ 𝑖+1 )≤𝑠𝑖𝑚 (𝑣 ᵱ 𝑖 ,𝑣 ᵱ 𝑖+1 )

Page 12: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Phase 1 (cont…)

Similarity Estimation (Inspired by Jaccard Index1)Context of vertices (attributes regularity)

• Weighted fraction of shared attributes instances

• It will be zero for contextually disjoint vertices • Example: Contextual similarity among

– Lets Wa1 = 1 and Wa2 = 2 then– SIM(V1, V3) = (2/2) = 1.0– SIM(V3, V4) = (1/2) = 0.5– V1 & V4 = 0.0

12

1P. Jaccard, Etude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura., Soci`et`e Vaudoise des Sciences Naturelles, Vol.37, (1901)

𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕𝒘𝒆𝒊𝒈𝒉𝒕𝒆𝒅=

∏𝒊=𝟏 ,𝒗 𝒂∧¿ 𝒗𝒃←𝒂 𝒊

𝑴

(𝒘 𝒂𝒊)

∏𝒋=𝟏 ,𝒗 𝒂∨¿ 𝒗𝒃←𝒂 𝒋

𝑴

(𝒘 𝒂 𝒋),𝒗𝒂↔𝒗𝒃∨¿ 𝒗𝒂⋯𝒗𝒃

Page 13: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Collaborative Similarity Measure

Structural

Contextual

Collaborative Measure

13

𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕𝒘𝒆𝒊𝒈𝒉𝒕𝒆𝒅=

∏𝒊=𝟏 ,𝒗 𝒂∧¿ 𝒗𝒃←𝒂 𝒊

𝑴

(𝒘 𝒂𝒊)

∏𝒋=𝟏 ,𝒗 𝒂∨¿ 𝒗𝒃←𝒂 𝒋

𝑴

(𝒘 𝒂 𝒋),𝒗𝒂↔𝒗𝒃∨¿ 𝒗𝒂⋯𝒗𝒃

𝐂𝐨𝐥𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐢𝐯𝐞𝐒𝐢𝐦𝐢𝐥𝐚𝐫𝐢𝐭𝐲=𝐂𝐒𝐢𝐦 (𝒗𝒂 ,𝒗𝒃 )=¿ {𝜶∗𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒔𝒕𝒓𝒖𝒄𝒕+(𝟏−𝜶 )∗𝑺𝑰𝑴 (𝒗𝒂 ,𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕 ,𝒗𝒂↔𝒗𝒃

¿∏𝒊=𝟏

𝒒

𝑪𝑺𝒊𝒎 (𝒗𝒑𝒊 ,𝒗𝒑𝒊+𝟏) ,𝒗𝒂⋯𝒗𝒃 ,𝒗𝒑 𝒊𝒔 𝒐𝒏𝒑𝒂𝒕𝒉𝒗𝒂𝒂𝒏𝒅𝒗𝒃

¿(𝟏−𝜶)∗𝑺𝑰𝑴 (𝒗𝒂 , 𝒗𝒃 )𝒄𝒐𝒏𝒕𝒆𝒙𝒕 ,𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

Page 14: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Phase 2

Clustering (K-Medoid Approach)

14

4. Update the centroids by maximizing SIM distances

3. Evaluate the quality of each cluster

2. Assign vertices to their nearest centroids

1. Randomly choose centroids for K clusters

Page 15: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Algorithm Details

15

Node Clustering

Similarity Calculation

Iterative

Single Pass

Page 16: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

(a) (b) (c)

𝐂𝐒𝐢𝐦ሺ𝒗𝒂 ,𝒗𝒃 ሻ

vertex V1 V2 V3 V4 V5 V6

V1 1 2.67 1.17 0.20 0.18 0.18

V2 2.67 1 0.92 0.15 0.14 0.14

V3 1.17 0.92 1 0.17 0.15 0.15

V4 0.2 0.15 0.17 1 0.92 0.92

V5 0.18 0.14 0.15 0.92 1 2.5

V6 0.18 0.14 0.15 0.92 2.5 1

K Clustered Vertices Density Entropy

2 {V1,V2,V3},{V4,V5,V6} 0.42 0.133

3 {V1,V3},{V2},{V4,V5,V6} 0.28 0.084

4 {V5},{V6},{V4},{V1,V2,V3} 0.21 0.084

(a) (b)

Example

16

Fig. 3. Scenarios for similarity between source (green) and destination(red) nodes following some intermediate nodes (yellow) (a) No direct path exist (b) Directly connected (c) In-directly connected, shortest path

Table 2. (a) Collaborative Similarity among vertices given in Fig. 3-c using Collaborative Similarity Measure, (b) Clustering results by varying number of clusters (K), quality of each measure is calculated using Density and Entropy

Page 17: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Experiments

Real DatasetPolitical Blogs Dataset: 1490 vertices, 19090 edges, one

attribute political leaning • Liberal• Conservative

MethodsK-SNAP: Attributes only S-Cluster: Structure-based clusteringW-Cluster: Weighted random walk strategy SA-Cluster: Consider both factors (matrix manipulation) IGC-CSM: Our proposed method

17

Page 18: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Evaluation Metrics

Density*: intra-cluster structural cohesiveness

Entropy*: intra-cluster attribute homogeneity

18

*Yang Zhou et al.,Graph Clustering Based on Structural/Attribute Similarities,Proceedings of VLDB Endowment,France (2009)

Page 19: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Evaluation Metrics (cont…)

F-Measure*: has the ability to evaluate the collective qualitative nature of the formed cluster

19

where

and

*Tijn Witsenburg et al., Improving the Accuracy of Similarity Measures by Using Link Information, International Symposium on Methodologies for Intelligent Systems Edition 9, Poland (2011)

Page 20: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Results (Time Complexity)

Synthetic Dataset Varying No. of Node

Real DatasetPolitical Blog*No. of Clusters vs. Time

20

Graph size vs. time

*htt

p:/

/ww

w-p

ers

on

al.

um

ich

.ed

u/m

ejn

/ne

tda

ta

Page 21: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Results (Quality)

Density EvaluationClusters vs. Density Value

Entropy EvaluationClusters vs. Entropy Value

21

Page 22: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Results (Quality)

F-Measure EstimationClusters vs. F-measure Value

22

Page 23: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab

Conclusion

We study the problem of graph node clustering based on homogeneous characteristics in terms of context and topology collaborative similarity measure to reflect the relational

model among pair of vertices k-Medoid clustering framework is adopted for grouping

similar nodesThe resulting solution is estimated using state of the

art evaluation measures:Density, Entropy, and F-measure

Comparatively scalable to medium scale graphs without compromising on the quality of results

23

Page 24: Collaborative Similarity Measure for Intra-Graph Clustering

Data & Knowledge Engineering LabData & Knowledge Engineering Lab 24

[email protected]@khu.ac.kr

[email protected]

ThanksAny Question…?

[email protected]@khu.ac.kr

[email protected]