35
http:// linc.ucy.ac.cy Andreas Papadopoulos - [email protected] [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International Conference on Database and Expert Systems Applications Sep. 1-4, 2015 Valencia, Spain Andreas Papadopoulos , Dimitrios Rafailidis, George Pallis, Marios D. Dikaiakos

Http://linc.ucy.ac.cy Andreas Papadopoulos - [email protected] [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Embed Size (px)

Citation preview

Page 1: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

ClusteringAttributed Multi-graphs

with Information Ranking

26th International Conference on Database and Expert Systems Applications

Sep. 1-4, 2015 Valencia, Spain

Andreas Papadopoulos, Dimitrios Rafailidis,George Pallis, Marios D. Dikaiakos

Page 2: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 2 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

The Real World: Information Networks

Friendship

Friendship

Coauth

or

Coauthor

Coauthor

Coauthor

Friendship

Coauthor

FriendshipCoauthor

Page 3: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 3 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

The Real World: Information Networks

Friendship

Friendship

Coauth

or

Coauthor

Coauthor

Coauthor

Friendship

Coauthor

FriendshipCoauthor

Page 4: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 4 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Challenges

• Identify importance of each edge-type/attribute property• For instance, clustering a bibliography network• Attribute ‘area of interest’ is important• Attributes ‘name’ and ‘gender’ may introduce noise and

reduce the clustering accuracy

• Combine the attribute and structural vertex properties• Edges and attributes are of different type

Page 5: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 5 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Related Work

• Limited attention to the different importance of attributes/edge-types• Weights are mainly updated at each iteration

• Ignore the existence of multiple edge-types• Increases computational cost and complexity

• Spectral clustering is not used for clustering attributed graphs • Used to identify dense clusters in attribute subspaces

Model-Based• BAGC [SIGMOD ‘12, TKDD ‘14]• CESNA [ICDM ‘13]

Distance-Based• SACluster [VLDB ‘09, TKDD ‘11]• PICS [SDM ‘12]• HASCOP [WI ‘13]

Page 6: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 6 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Proposed Approach: CAMIR

• Clustering Attributed Multi-graphs with Information Ranking: CAMIR

1. Rank edge-type and attribute properties

2. Construct a unified similarity matrix

3. Adopt spectral clustering technique to generate the final clusters

Page 7: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR EvaluationSummary

Page 8: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR EvaluationSummary

Page 9: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 9 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

• An edge represents the similarity of the two connected vertices

• Find the minimum cut of a graph• Minimizes inter-cluster similarities• Identifies an optimal partitioning of the graph

• Identifying a minimum cut is computationally difficult• Efficient approximations using linear algebra

Background: Graph Partitioning

Page 10: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 10 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

• Based on the graph Laplacian, or Laplacian matrix

• Given a similarity matrix The normalized symmetric Laplacian L is defined as

• The eigenvectors corresponding to top k eigenvalues are the projection of the graph into R|V| x k • Data is easily separable into clusters, i.e. using k-means

Background: Spectral Clustering

Page 11: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 11 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Background: Spectral Clustering

10

1

2

3

4

5

6

78

910

1

2

3

4

5

6

78

95

1

7

12 19

134

30

20

8

Adjacency Matrix1 2 3 4 5 6 7 8 9 10

1 1 12 1 13 1 1 14 1 15 1 167 189

10

Laplacian Matrix1 2 3 4 5 6 7 8 9 10

1 1 -0.354 -0.52 1 -0.408 -0.4083 -0.354 1 -0.25 -0.354-0.3544 -0.408 -0.289 -0.3335 -0.25 -0.289 1 -0.5 -0.2896 -0.5 17 1 -0.7078 -0.408 -0.333-0.289 19 -0.354 -0.707 1

10 -0.5 -0.354 1

Top 3 eigenvectorsU1 U2 U3

1 -0.659 -0.705 0.2632 -0.620 0.747 0.2413 -0.595 -0.486 -0.6404 -0.668 0.711 -0.2215 -0.723 0.395 0.5666 -0.669 0.414 -0.6177 -0.332 -0.486 -0.8088 -0.668 0.711 -0.2219 -0.379 -0.491 0.784

10 -0.659 -0.705 0.263

Page 12: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 12 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

How do we define the similarity matrix

for an attributed multi-graph?

Page 13: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 13 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Background: Similarity Matrices

IR

DM

DM

DM AI

AIAI

AI

AI

IR

[0,1]N X N

5

1

7

12

19

134

30

20

8

0

1

2

3

4

5

6

78

9

Gaussian Kernel

[0,1]N X N

Edges[0,1]N X N

#Edge types + #AttributesSymmetric Non-negative Similarity

Matrices

How do we efficiently combine the similarity matrices?

Page 14: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 14 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR EvaluationSummary

Page 15: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 15 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

CAMIR Overview

1. Rank vertex properties and calculate their weights accordingly• By considering the agreement among vertex properties

2. Compute a unified similarity matrix• By combining all vertex properties based on their ranking

3. Generate the final clusters• By adopting a spectral clustering approach

Page 16: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 16 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR

1. Information Ranking2. Unified Similarity Matrix3. Generate the final clusters

EvaluationSummary

Page 17: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 17 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

• Most informative property [NIPS ’11]:

• Has the highest ‘agreement’ with other properties• ‘agree’ assign vertices the same cluster labels when used individually

Information Ranking

Rank attribute and edge type propertiesIteratively select from the set of unranked properties the most informative property

Page 18: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 18 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Information Ranking

From the set of properties ( ), the most informative property is p [NIPS ‘11]

• The highest rank (| |) is assigned to the most informative property

• i.e. best separates the vertices

• The lowest rank (1.0) is assigned to the property that is selected last

• i.e. does not ‘agree’ with the rest of properties

Rank attribute and edge type propertiesIteratively select from the set of unranked properties the most informative property

Page 19: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 19 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR

1. Information Ranking2. Unified Similarity Matrix3. Generate the final clusters

EvaluationSummary

Page 20: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 20 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Unified Similarity Matrix

• Combines the multiple edge-type and attribute

properties with respect to identified ranking

• Defined as the weighted sum of the individual

similarity matrices

• Weights are defined by normalizing the rankings

• Contains all the similarity information about the network

under study

Page 21: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 21 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR

1. Information Ranking2. Unified Similarity Matrix3. Generate the final clusters

EvaluationSummary

Page 22: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 22 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Generating the Final Clusters

• Calculate normalized Laplacian of Unified

Similarity Matrix

• Perform Eigen decomposition

• Apply k-means to the eigenspace of top k

eigenvectors

• Generate the final clusters

Page 23: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 23 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

CAMIR Clustering Process Diagram

Properties rankingUnified Similarity

MatrixGenerate the final

clusters

Cluster 1Cluster 2

…Cluster k

Iteratively Select the Most Informative

Property

Apply Spectral Clustering

Normalize Rankings andCompute the

Unified Similarity Matrix

Step 1. Identify importance of vertex

properties

Step 2. Efficiently combine vertex

properties

Step 3. Cluster the attributed multi-graph

Page 24: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 24 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR EvaluationSummary

Page 25: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 25 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Evaluation - Datasets

• Real-World Datasets• DBLP: Bibliography Networks• GoogleSP23: Google Software Packages

Dataset DBLP-1K DBLP-10K GoogleSP-23

Nodes 1 000 10 000 1 297

Edges 17 128 65 734 268 956

Attributes 2 2 5

Edge Types 1 1 2

Total Vertex Properties 3 3 7

Synthetic Datasets

{100, 500, 1 000, 5 000, 10 000} 1 000

{1 000 – 1 230 000} ~ 40 000

4 {2, 4, 8, 16, 32}

1 1

5 {3, 5, 9, 17, 33}

Page 26: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 26 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

• Entropy

• Low entropy equals to high attribute homogeneity

• Normalized Mutual Information (NMI)

• High NMI is equivalent to high similarity between the

resulted clustering and the ground-truth

• NMI of value 1 indicates perfect match

• Runtime

• Quad-core i7 2.8Ghz, 8 Gb RAM

Evaluation Measures

Page 27: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 27 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

• SACluster [VLDB 2009]

• Similarity is defined as the Random Walk distance in the augmented graph

• BAGC [SIGMOD 2012]

• Uses Bayesian inference to update the parameters of the clusters

distributions

• PICS [SDM 2012]

• Compresses adjacency and attribute matrices

• HASCOP [WI 2013]

• Heuristic distance-based

• Applies to attributed multi-graphs

State-of-the-Art Competitors

Page 28: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 28 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Evaluation - Synthetic Datasets• CAMIR Entropy is

always less than 0.5• High Attribute

homogeneity

• CAMIR NMI is at least 0.8 on all experiments• High quality results

• Similar behavior as we increase the number of attributes

Page 29: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 29 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Evaluation - Synthetic Datasets

• CAMIR is the 2nd fastest algorithm• Less than 10 secs for

up to 5000 vertices

• CAMIR on average outperforms almost all its competitors

Page 30: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 30 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Evaluation - Real-world DatasetsDBLP-1K

DBLP-10K

• CAMIR achieves the

lowest entropy among

its competitors• Efficiently ranks and

combines vertex

properties

• Identifies clusters of

arbitrary shapes and

sizes (Spectral clustering)

Page 31: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 31 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Evaluation - Real-world Datasets

GoogleSP-23

GoogleSP-23

• CAMIR achieves low

entropy

• CAMIR achieves high

NMI• Identifies a high

percentage of software packages

Page 32: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 32 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Evaluation – Runtime and EntropyAlgorithm DBLP-1K DBLP-10K GoogleSP23

Runtime(secs) Entropy Runtime

(secs) Entropy Runtime(secs) Entropy

CAMIR 1.20 0.299 520.48 0.255 5.98 0.387

BAGC 0.15 1.448 0.35 1.649 0.81 1.573

SACluster 3.22 0.729 433.228 1.066 30.57 1.513

PICS 4.87 1.280 495.17 1.877 476.49 2.178

HASCOP 882.17 0.838 32957 1.306 4675 0.061

• CAMIR requires:• Less than 6 secs for ~1000 vertices• About 8 minutes for 10000 vertices

• CAMIR achieves on average 55% time and 60% entropy improvement

• BAGC is the fastest method, but achieved limited clustering quality• HASCOP achieved slightly better results than CAMIR, but it is the slowest

method

Page 33: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 33 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Presentation OutlineMotivationProblem DefinitionRelated WorkBackgroundProposed Approach: CAMIR EvaluationSummary

Page 34: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

Slide 34 of 35 http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

Summary

• A new approach for Clustering Attributed Multi-graphs with

Information Ranking: CAMIR

• A new mechanism to rank and weigh vertex properties• Identifies the importance of each attribute and edge-type property

• A unified similarity matrix for attributed multi-graphs• Efficiently combines vertex properties

• Identify clusters of arbitrary sizes and shapes• Effective in terms of clustering accuracy and computational

time

Page 35: Http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International

http://linc.ucy.ac.cyAndreas Papadopoulos - [email protected] [DEXA 2015]

ClusteringAttributed Multi-graphs

with Information Ranking

Andreas Papadopoulos, Dimitrios Rafailidis,George Pallis, Marios D. Dikaiakos

Department of Computer ScienceUniversity of Cyprus

Thank You!