1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman

11

Learning Semantics-Learning Semantics-Preserving Distance Preserving Distance

Metrics for Clustering Metrics for Clustering Graphical DataGraphical Data

Aparna S. Varde, Elke A. Rundensteiner, Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Carolina Ruiz, Mohammed

Maniruzzaman and Richard D. Sisson Jr.Maniruzzaman and Richard D. Sisson Jr.

ACM SIGKDD: MDM-05ACM SIGKDD: MDM-05

Chicago, Illinois, USA Chicago, Illinois, USA

Aug 21, 2005Aug 21, 2005

22

IntroductionIntroduction

Experimental results in Experimental results in scientific domains often scientific domains often plotted graphicallyplotted graphically

Graph: 2-d plot of one Graph: 2-d plot of one experimental experimental parameter versus parameter versus anotheranother

Heat Transfer CurveDomain: Heat Treating of Materials

Domain SemanticsLF: Leidenfrost point

BP: Boiling point of cooling medium

MAX: Maximum heat transfer

33

Motivating ExampleMotivating Example

Clustering used to group graphs to Clustering used to group graphs to compare them compare them

Inferences drawn from comparison Inferences drawn from comparison help in decision support help in decision support

Clustering groups objects based Clustering groups objects based on similarityon similarity

Notion of similarity: distanceNotion of similarity: distance

Not known what distance metric Not known what distance metric best preserves semantics in best preserves semantics in clusteringclustering

Need to learn such a metricNeed to learn such a metric

Problem clustering with Euclidean Distance:

Graphs in same cluster, LF points differ

44

Proposed Approach: LearnMetProposed Approach: LearnMet

Given training set with actual clusters of graphsGiven training set with actual clusters of graphs Correct as per domainCorrect as per domain

Compare these with predicted clusters Compare these with predicted clusters Obtained from an algorithm Obtained from an algorithm

ProcessProcess Guess initial metric for clusteringGuess initial metric for clustering Refine metric using error between predicted, actual Refine metric using error between predicted, actual

clusters clusters Output metric with error below threshold as learned Output metric with error below threshold as learned

metricmetric

55

Categories of Distance MetricsCategories of Distance Metrics

In the literatureIn the literature Position-based: e.g., Position-based: e.g.,

Euclidean [HK-01]Euclidean [HK-01] Absolute position of Absolute position of

objectsobjects Statistical: e.g., Maximum Statistical: e.g., Maximum

distance [PNC-99]distance [PNC-99] Statistical observations Statistical observations

Introduced in our work Introduced in our work [VRRMS-03/05][VRRMS-03/05] Critical Distance: Critical Distance:

Distance between critical Distance between critical regions on graphsregions on graphs

Calculated in a domain-Calculated in a domain-specific mannerspecific manner

DMax(A,B)

Graph A Graph B

DLF (A,B)

DBP(A,B)

Graph A Graph B

66

LearnMet StrategyLearnMet Strategy

1.1. Initial Metric Step:Initial Metric Step: Guess initial metric Guess initial metric

2.2. Clustering Step:Clustering Step: Do clustering with Do clustering with metricmetric

3.3. Evaluation Step:Evaluation Step: Evaluate cluster Evaluate cluster accuracyaccuracy

4.4. Adjustment Step:Adjustment Step: Adjust & re-execute / Adjust & re-execute / halthalt

5.5. Final Metric Step:Final Metric Step: Output final metric Output final metric

77

1. Initial Metric Step1. Initial Metric Step

Input from domain expertsInput from domain experts Distance types applicable to graphsDistance types applicable to graphs Relative Importance of each type (Optional input) Relative Importance of each type (Optional input)

Initial MetricInitial Metric Each distance type forms a componentEach distance type forms a component Initial Weight HeuristicInitial Weight Heuristic

Assign weights to components based on relative importanceAssign weights to components based on relative importance Or assign random weights Or assign random weights

Metric DefinitionMetric Definition D = wD = w11DD11 + w + w22DD22 + …….. w + …….. wmmDDmm

Example: D = 5DExample: D = 5DEuclidean Euclidean + 3D+ 3DMeanMean + 4D + 4DCriticalCritical

88

2. Clustering Step2. Clustering Step

Clustering Algorithm, e.g.k-means

k = number of actual clusters

Notion of Distance = D

99

3. Cluster Evaluation Step3. Cluster Evaluation Step

Pair of graphsPair of graphs True Positive (TP): same actual, same predicted cluster, e.g., (g same actual, same predicted cluster, e.g., (g11, g, g22)) True Negative (TN): different actual, different predicted clusters, e.g., (g different actual, different predicted clusters, e.g., (g22,g,g33)) False Positive (FP): different actual, same predicted cluster, e.g., (gdifferent actual, same predicted cluster, e.g., (g33,g,g44)) False Negative (FN): same actual, different predicted clusters, e.g., (gsame actual, different predicted clusters, e.g., (g44,g,g55))

Error Measure: Failure Rate “FR”Error Measure: Failure Rate “FR” FR = (FP+FN)/(TP+FP+TN+FN) FR = (FP+FN)/(TP+FP+TN+FN)

1010

3. Cluster Evaluation (Contd.)3. Cluster Evaluation (Contd.) Total number of graphs = GTotal number of graphs = G

No. of pairs = No. of pairs = GGCC22=G!/2!(G-=G!/2!(G-2)!2)!

E.g., 25 graphs, 300 pairsE.g., 25 graphs, 300 pairs

Pairs per epoch: ppePairs per epoch: ppe Distinct Distinct combination of pairs combination of pairs If ppe = 25, total distinct If ppe = 25, total distinct

pairs = pairs = 300300CC25 25 = 1.95 x 10= 1.95 x 103636 Avoids overfitting, reduces Avoids overfitting, reduces

time complexitytime complexity

Error Threshold (t)Error Threshold (t) Extent of error (FR) allowed Extent of error (FR) allowed If (FR < t) clustering is If (FR < t) clustering is

accurateaccurate

Consider ppe = 15, t = 0.1TP = 2, TN = 10FP = 1 (g3, g4)FN = 2 (g4, g5), (g4,g6)

FR = (1+2) / 15 = 0.2 FR > t, clustering not accurate

1111

3. Cluster Evaluation (Contd.)3. Cluster Evaluation (Contd.)

Distance between pair of graphsDistance between pair of graphs D(gD(gaa,g,gbb) = ) = ww11DD11(g(gaa,g,gbb) + … ) + …

wwmmDDmm(g(gaa,g,gbb) )

D_FN=(1/FN)∑D_FN=(1/FN)∑i= 1 to FNi= 1 to FN D(gD(gaa,g,gbb)) Cause of error: D_FN too highCause of error: D_FN too high

D_FP=(1/FP)∑D_FP=(1/FP)∑ i= 1 to FPi= 1 to FP D(gD(gaa,g,gbb)) Cause of error: D_FP too low Cause of error: D_FP too low

D_FN = [D(g4,g5)+D(g4,g6)] / 2

D_FP = D(g3,g4) / 1

1212

4. Weight Adjustment Step4. Weight Adjustment Step

FN pairsFN pairs To reduce error decrease To reduce error decrease

D_FND_FN

FN Heuristic FN Heuristic Decrease weights in Decrease weights in

proportion to distance proportion to distance contribution of each contribution of each componentcomponent

For each componentFor each component wwii`= w`= wi i – D_FN– D_FNii/D_FN /D_FN where D_FNwhere D_FNii = =

(1/FN)∑(1/FN)∑i= 1 to FNi= 1 to FN DDii(g(gaa,g,gbb))

D = 5DEuclidean + 3DMean + 4DCritical

D_FN = 100, D_FNEuclidean = 80, D_FNMean = 1, D_FNCritical = 19wEuclidean` = 5 – 80/100 = 5 – 0.8 = 4.2wMean` = 3 – 1/100 = 3 – 0.01 = 2.99wCritical` = 4 – 19/100 = 4 – 0.19 = 3.81

1313

4. Weight Adjustment (Contd.)4. Weight Adjustment (Contd.)

FP pairs FP pairs To reduce error increase To reduce error increase

D_FPD_FP

FP HeuristicFP Heuristic Increase weights in Increase weights in

proportion to distance proportion to distance contribution of each contribution of each componentcomponent

For each component For each component wwii``= w``= wi i + D_FN+ D_FNii/D_FN /D_FN where D_FNwhere D_FNii = =

(1/FN)∑ (1/FN)∑ i= 1 to FNi= 1 to FN DDii(g(gaa,g,gbb))

D = 5DEuclidean + 3DMean + 4DCritical

D_FP = 200, D_FPEuclidean = 15, D_FPMean = 85, D_FPCritical = 100wEuclidean`` = 5 + 15/200 = 5 + 0.075 = 5.075wMean`` = 3 + 85/200 = 3 + 0.425 = 3.425wCritical`` = 4 + 100/200 = 4 + 0.5 = 4.5

1414

4. Weight Adjustment 4. Weight Adjustment (Contd.)(Contd.)

Combining the twoCombining the two Weight Adjustment HeuristicWeight Adjustment Heuristic

wwii```= min(0, w```= min(0, wi i – FN*(D_FN– FN*(D_FNii/D_FN) + FP*(D_FP/D_FN) + FP*(D_FPii/D_FP))/D_FP))

D``` = wD``` = w11```Dc```Dc11 + w + w22```Dc```Dc22 + …….. w + …….. wmm```Dc```Dcmm

Clustering is done with new metric D```Clustering is done with new metric D```

If clustering accurate, confirmatory test with If clustering accurate, confirmatory test with metric for 2 more epochsmetric for 2 more epochs

1515

5. Final Metric Step5. Final Metric Step

If error below threshold, output corresponding D If error below threshold, output corresponding D as learned metricas learned metric

If maximum epochs reached, output D with If maximum epochs reached, output D with lowest error in all epochslowest error in all epochs

Example: Example: D = 4.671D_Euclidean + 5.564D_Mean + D = 4.671D_Euclidean + 5.564D_Mean +

3.074D_Critical3.074D_Critical

1616

Experimental Evaluation of Experimental Evaluation of LearnMetLearnMet

Evaluated rigorously in Heat Treating domainEvaluated rigorously in Heat Treating domain Summary of evaluationSummary of evaluation

Number of graphs in training set = G = 25Number of graphs in training set = G = 25Number of pairs in training set = Number of pairs in training set = GGCC22 = 300 = 300Number of pairs per epoch = ppe = 25Number of pairs per epoch = ppe = 25Number of distinct combinations of pairs = Number of distinct combinations of pairs = 300300CC25 25 = 1.95 x = 1.95 x

10103636 Error threshold = t = 0.1 = 10% Error threshold = t = 0.1 = 10% Distinct test set with 300 pairs of graphs from 25 different Distinct test set with 300 pairs of graphs from 25 different

graphsgraphsActual clusters over test set given by expertsActual clusters over test set given by experts Initial Metrics Initial Metrics

DE1, DE2 given by domain experts DE1, DE2 given by domain experts EQU with equal weights to all components EQU with equal weights to all components Several metrics with random weights, e.g., RND1, RND2Several metrics with random weights, e.g., RND1, RND2

1717

Experimental Evaluation Experimental Evaluation (Contd.)(Contd.)

Initial Metrics in LearnMet Experiments

Learned Metrics and Number of Epochs to Learn

1818

Observations during TrainingObservations during Training

Error over Training Set

0.00

0.05

0.10

0.150.20

0.25

0.300.35

0.40

1 3 5 7 9 11 13 15 17

Epochs

Fai

lure

Rat

e

Failure Rate


0.00

0.05

0.10

0.15

0.20

0.25

0.30

1 3 5 7 9 11 13 15 17 19

Epochs

Fai

lure

Rat

e

Failure Rate


0.00

0.050.10

0.150.20

0.25

0.300.35

0.40

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43

Epochs

Fai

lure

Rat

e

Failure Rate


0.000.050.100.150.200.250.300.350.400.45

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57

Epochs

Fai

lure

Rat

e

Failure Rate


0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

1 4 7 10 13 16 19 22 25 28 31 34 37

Epochs

Fai

lure

Rat

e

Failure Rate

ExptDE1

ExptDE2

Expt EQU

ExptRND1

ExptRND2

1919

Observations during TestingObservations during Testing

Graphs in test set Graphs in test set clustered clustered Learned MetricsLearned Metrics Euclidean Distance (ED)Euclidean Distance (ED)

Predicted clusters Predicted clusters compared with actual compared with actual clusters clusters

Accuracy: Success Rate Accuracy: Success Rate “SR”“SR” SR = (TP+TN)/(TP+TN+FP+FN)SR = (TP+TN)/(TP+TN+FP+FN)

Clustering Accuracy over Test Set

0 0.2 0.4 0.6 0.8 1

DE1

DE2

EQU

RND1

RND2

ED

Met

ric

Accuracy (Success Rate)

Accuracy

Test Set Observations:Accuracy with LearnMet

Metrics is higher

2020

Related WorkRelated Work Learning nearest neighbor in high-dimensional spaces: [HAK-00]Learning nearest neighbor in high-dimensional spaces: [HAK-00]

Focus is dimensionality reduction, do not deal with graphsFocus is dimensionality reduction, do not deal with graphs

Distance metric learning given basic formula: [XNJR-03]Distance metric learning given basic formula: [XNJR-03] Do not deal with graphs and relative importance of featuresDo not deal with graphs and relative importance of features

Similarity search in multimedia databases: [KB-04] Similarity search in multimedia databases: [KB-04] Use various metrics in different applications, do not learn a single Use various metrics in different applications, do not learn a single

metricmetric

Fourier Transforms: [F-55] Fourier Transforms: [F-55] Do not preserve critical regions in the domain due to nature of Do not preserve critical regions in the domain due to nature of

transformtransform Genetic Algorithms: [F-58]Genetic Algorithms: [F-58]

If used for feature selection give less accuracy: lack domain If used for feature selection give less accuracy: lack domain knowledgeknowledge

Linear Regression: [A-73] Linear Regression: [A-73] Distance values between pairs of graphs not known as training set Distance values between pairs of graphs not known as training set

Neural Networks: [B-96] Neural Networks: [B-96] Poor interpretability, hard to incorporate domain knowledge Poor interpretability, hard to incorporate domain knowledge

2121

ConclusionsConclusions

LearnMet proposed to learn semantics-LearnMet proposed to learn semantics-preserving distance metrics for graphspreserving distance metrics for graphs

Minimize error between predicted and actual Minimize error between predicted and actual clusters of graphsclusters of graphs

Ongoing Work: Maximizing accuracy of Ongoing Work: Maximizing accuracy of LearnMetLearnMet

Finding good value for number of pairs per epochFinding good value for number of pairs per epochSelecting components without domain expert inputSelecting components without domain expert inputDefining scaling factors for weight adjustmentsDefining scaling factors for weight adjustments

Documents

1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman