View
220
Download
0
Tags:
Embed Size (px)
Citation preview
11
Learning Semantics-Learning Semantics-Preserving Distance Preserving Distance
Metrics for Clustering Metrics for Clustering Graphical DataGraphical Data
Aparna S. Varde, Elke A. Rundensteiner, Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Carolina Ruiz, Mohammed
Maniruzzaman and Richard D. Sisson Jr.Maniruzzaman and Richard D. Sisson Jr.
ACM SIGKDD: MDM-05ACM SIGKDD: MDM-05
Chicago, Illinois, USA Chicago, Illinois, USA
Aug 21, 2005Aug 21, 2005
22
IntroductionIntroduction
Experimental results in Experimental results in scientific domains often scientific domains often plotted graphicallyplotted graphically
Graph: 2-d plot of one Graph: 2-d plot of one experimental experimental parameter versus parameter versus anotheranother
Heat Transfer CurveDomain: Heat Treating of Materials
Domain SemanticsLF: Leidenfrost point
BP: Boiling point of cooling medium
MAX: Maximum heat transfer
33
Motivating ExampleMotivating Example
Clustering used to group graphs to Clustering used to group graphs to compare them compare them
Inferences drawn from comparison Inferences drawn from comparison help in decision support help in decision support
Clustering groups objects based Clustering groups objects based on similarityon similarity
Notion of similarity: distanceNotion of similarity: distance
Not known what distance metric Not known what distance metric best preserves semantics in best preserves semantics in clusteringclustering
Need to learn such a metricNeed to learn such a metric
Problem clustering with Euclidean Distance:
Graphs in same cluster, LF points differ
44
Proposed Approach: LearnMetProposed Approach: LearnMet
Given training set with actual clusters of graphsGiven training set with actual clusters of graphs Correct as per domainCorrect as per domain
Compare these with predicted clusters Compare these with predicted clusters Obtained from an algorithm Obtained from an algorithm
ProcessProcess Guess initial metric for clusteringGuess initial metric for clustering Refine metric using error between predicted, actual Refine metric using error between predicted, actual
clusters clusters Output metric with error below threshold as learned Output metric with error below threshold as learned
metricmetric
55
Categories of Distance MetricsCategories of Distance Metrics
In the literatureIn the literature Position-based: e.g., Position-based: e.g.,
Euclidean [HK-01]Euclidean [HK-01] Absolute position of Absolute position of
objectsobjects Statistical: e.g., Maximum Statistical: e.g., Maximum
distance [PNC-99]distance [PNC-99] Statistical observations Statistical observations
Introduced in our work Introduced in our work [VRRMS-03/05][VRRMS-03/05] Critical Distance: Critical Distance:
Distance between critical Distance between critical regions on graphsregions on graphs
Calculated in a domain-Calculated in a domain-specific mannerspecific manner
DMax(A,B)
Graph A Graph B
DLF (A,B)
DBP(A,B)
Graph A Graph B
66
LearnMet StrategyLearnMet Strategy
1.1. Initial Metric Step:Initial Metric Step: Guess initial metric Guess initial metric
2.2. Clustering Step:Clustering Step: Do clustering with Do clustering with metricmetric
3.3. Evaluation Step:Evaluation Step: Evaluate cluster Evaluate cluster accuracyaccuracy
4.4. Adjustment Step:Adjustment Step: Adjust & re-execute / Adjust & re-execute / halthalt
5.5. Final Metric Step:Final Metric Step: Output final metric Output final metric
77
1. Initial Metric Step1. Initial Metric Step
Input from domain expertsInput from domain experts Distance types applicable to graphsDistance types applicable to graphs Relative Importance of each type (Optional input) Relative Importance of each type (Optional input)
Initial MetricInitial Metric Each distance type forms a componentEach distance type forms a component Initial Weight HeuristicInitial Weight Heuristic
Assign weights to components based on relative importanceAssign weights to components based on relative importance Or assign random weights Or assign random weights
Metric DefinitionMetric Definition D = wD = w11DD11 + w + w22DD22 + …….. w + …….. wmmDDmm
Example: D = 5DExample: D = 5DEuclidean Euclidean + 3D+ 3DMeanMean + 4D + 4DCriticalCritical
88
2. Clustering Step2. Clustering Step
Clustering Algorithm, e.g.k-means
k = number of actual clusters
Notion of Distance = D
99
3. Cluster Evaluation Step3. Cluster Evaluation Step
Pair of graphsPair of graphs True Positive (TP): same actual, same predicted cluster, e.g., (g same actual, same predicted cluster, e.g., (g11, g, g22)) True Negative (TN): different actual, different predicted clusters, e.g., (g different actual, different predicted clusters, e.g., (g22,g,g33)) False Positive (FP): different actual, same predicted cluster, e.g., (gdifferent actual, same predicted cluster, e.g., (g33,g,g44)) False Negative (FN): same actual, different predicted clusters, e.g., (gsame actual, different predicted clusters, e.g., (g44,g,g55))
Error Measure: Failure Rate “FR”Error Measure: Failure Rate “FR” FR = (FP+FN)/(TP+FP+TN+FN) FR = (FP+FN)/(TP+FP+TN+FN)
1010
3. Cluster Evaluation (Contd.)3. Cluster Evaluation (Contd.) Total number of graphs = GTotal number of graphs = G
No. of pairs = No. of pairs = GGCC22=G!/2!(G-=G!/2!(G-2)!2)!
E.g., 25 graphs, 300 pairsE.g., 25 graphs, 300 pairs
Pairs per epoch: ppePairs per epoch: ppe Distinct Distinct combination of pairs combination of pairs If ppe = 25, total distinct If ppe = 25, total distinct
pairs = pairs = 300300CC25 25 = 1.95 x 10= 1.95 x 103636 Avoids overfitting, reduces Avoids overfitting, reduces
time complexitytime complexity
Error Threshold (t)Error Threshold (t) Extent of error (FR) allowed Extent of error (FR) allowed If (FR < t) clustering is If (FR < t) clustering is
accurateaccurate
Consider ppe = 15, t = 0.1TP = 2, TN = 10FP = 1 (g3, g4)FN = 2 (g4, g5), (g4,g6)
FR = (1+2) / 15 = 0.2 FR > t, clustering not accurate
1111
3. Cluster Evaluation (Contd.)3. Cluster Evaluation (Contd.)
Distance between pair of graphsDistance between pair of graphs D(gD(gaa,g,gbb) = ) = ww11DD11(g(gaa,g,gbb) + … ) + …
wwmmDDmm(g(gaa,g,gbb) )
D_FN=(1/FN)∑D_FN=(1/FN)∑i= 1 to FNi= 1 to FN D(gD(gaa,g,gbb)) Cause of error: D_FN too highCause of error: D_FN too high
D_FP=(1/FP)∑D_FP=(1/FP)∑ i= 1 to FPi= 1 to FP D(gD(gaa,g,gbb)) Cause of error: D_FP too low Cause of error: D_FP too low
D_FN = [D(g4,g5)+D(g4,g6)] / 2
D_FP = D(g3,g4) / 1
1212
4. Weight Adjustment Step4. Weight Adjustment Step
FN pairsFN pairs To reduce error decrease To reduce error decrease
D_FND_FN
FN Heuristic FN Heuristic Decrease weights in Decrease weights in
proportion to distance proportion to distance contribution of each contribution of each componentcomponent
For each componentFor each component wwii`= w`= wi i – D_FN– D_FNii/D_FN /D_FN where D_FNwhere D_FNii = =
(1/FN)∑(1/FN)∑i= 1 to FNi= 1 to FN DDii(g(gaa,g,gbb))
D = 5DEuclidean + 3DMean + 4DCritical
D_FN = 100, D_FNEuclidean = 80, D_FNMean = 1, D_FNCritical = 19wEuclidean` = 5 – 80/100 = 5 – 0.8 = 4.2wMean` = 3 – 1/100 = 3 – 0.01 = 2.99wCritical` = 4 – 19/100 = 4 – 0.19 = 3.81
1313
4. Weight Adjustment (Contd.)4. Weight Adjustment (Contd.)
FP pairs FP pairs To reduce error increase To reduce error increase
D_FPD_FP
FP HeuristicFP Heuristic Increase weights in Increase weights in
proportion to distance proportion to distance contribution of each contribution of each componentcomponent
For each component For each component wwii``= w``= wi i + D_FN+ D_FNii/D_FN /D_FN where D_FNwhere D_FNii = =
(1/FN)∑ (1/FN)∑ i= 1 to FNi= 1 to FN DDii(g(gaa,g,gbb))
D = 5DEuclidean + 3DMean + 4DCritical
D_FP = 200, D_FPEuclidean = 15, D_FPMean = 85, D_FPCritical = 100wEuclidean`` = 5 + 15/200 = 5 + 0.075 = 5.075wMean`` = 3 + 85/200 = 3 + 0.425 = 3.425wCritical`` = 4 + 100/200 = 4 + 0.5 = 4.5
1414
4. Weight Adjustment 4. Weight Adjustment (Contd.)(Contd.)
Combining the twoCombining the two Weight Adjustment HeuristicWeight Adjustment Heuristic
wwii```= min(0, w```= min(0, wi i – FN*(D_FN– FN*(D_FNii/D_FN) + FP*(D_FP/D_FN) + FP*(D_FPii/D_FP))/D_FP))
D``` = wD``` = w11```Dc```Dc11 + w + w22```Dc```Dc22 + …….. w + …….. wmm```Dc```Dcmm
Clustering is done with new metric D```Clustering is done with new metric D```
If clustering accurate, confirmatory test with If clustering accurate, confirmatory test with metric for 2 more epochsmetric for 2 more epochs
1515
5. Final Metric Step5. Final Metric Step
If error below threshold, output corresponding D If error below threshold, output corresponding D as learned metricas learned metric
If maximum epochs reached, output D with If maximum epochs reached, output D with lowest error in all epochslowest error in all epochs
Example: Example: D = 4.671D_Euclidean + 5.564D_Mean + D = 4.671D_Euclidean + 5.564D_Mean +
3.074D_Critical3.074D_Critical
1616
Experimental Evaluation of Experimental Evaluation of LearnMetLearnMet
Evaluated rigorously in Heat Treating domainEvaluated rigorously in Heat Treating domain Summary of evaluationSummary of evaluation
Number of graphs in training set = G = 25Number of graphs in training set = G = 25Number of pairs in training set = Number of pairs in training set = GGCC22 = 300 = 300Number of pairs per epoch = ppe = 25Number of pairs per epoch = ppe = 25Number of distinct combinations of pairs = Number of distinct combinations of pairs = 300300CC25 25 = 1.95 x = 1.95 x
10103636 Error threshold = t = 0.1 = 10% Error threshold = t = 0.1 = 10% Distinct test set with 300 pairs of graphs from 25 different Distinct test set with 300 pairs of graphs from 25 different
graphsgraphsActual clusters over test set given by expertsActual clusters over test set given by experts Initial Metrics Initial Metrics
DE1, DE2 given by domain experts DE1, DE2 given by domain experts EQU with equal weights to all components EQU with equal weights to all components Several metrics with random weights, e.g., RND1, RND2Several metrics with random weights, e.g., RND1, RND2
1717
Experimental Evaluation Experimental Evaluation (Contd.)(Contd.)
Initial Metrics in LearnMet Experiments
Learned Metrics and Number of Epochs to Learn
1818
Observations during TrainingObservations during Training
Error over Training Set
0.00
0.05
0.10
0.150.20
0.25
0.300.35
0.40
1 3 5 7 9 11 13 15 17
Epochs
Fai
lure
Rat
e
Failure Rate
Error over Training Set
0.00
0.05
0.10
0.15
0.20
0.25
0.30
1 3 5 7 9 11 13 15 17 19
Epochs
Fai
lure
Rat
e
Failure Rate
Error over Training Set
0.00
0.050.10
0.150.20
0.25
0.300.35
0.40
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
Epochs
Fai
lure
Rat
e
Failure Rate
Error over Training Set
0.000.050.100.150.200.250.300.350.400.45
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57
Epochs
Fai
lure
Rat
e
Failure Rate
Error over Training Set
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
1 4 7 10 13 16 19 22 25 28 31 34 37
Epochs
Fai
lure
Rat
e
Failure Rate
ExptDE1
ExptDE2
Expt EQU
ExptRND1
ExptRND2
1919
Observations during TestingObservations during Testing
Graphs in test set Graphs in test set clustered clustered Learned MetricsLearned Metrics Euclidean Distance (ED)Euclidean Distance (ED)
Predicted clusters Predicted clusters compared with actual compared with actual clusters clusters
Accuracy: Success Rate Accuracy: Success Rate “SR”“SR” SR = (TP+TN)/(TP+TN+FP+FN)SR = (TP+TN)/(TP+TN+FP+FN)
Clustering Accuracy over Test Set
0 0.2 0.4 0.6 0.8 1
DE1
DE2
EQU
RND1
RND2
ED
Met
ric
Accuracy (Success Rate)
Accuracy
Test Set Observations:Accuracy with LearnMet
Metrics is higher
2020
Related WorkRelated Work Learning nearest neighbor in high-dimensional spaces: [HAK-00]Learning nearest neighbor in high-dimensional spaces: [HAK-00]
Focus is dimensionality reduction, do not deal with graphsFocus is dimensionality reduction, do not deal with graphs
Distance metric learning given basic formula: [XNJR-03]Distance metric learning given basic formula: [XNJR-03] Do not deal with graphs and relative importance of featuresDo not deal with graphs and relative importance of features
Similarity search in multimedia databases: [KB-04] Similarity search in multimedia databases: [KB-04] Use various metrics in different applications, do not learn a single Use various metrics in different applications, do not learn a single
metricmetric
Fourier Transforms: [F-55] Fourier Transforms: [F-55] Do not preserve critical regions in the domain due to nature of Do not preserve critical regions in the domain due to nature of
transformtransform Genetic Algorithms: [F-58]Genetic Algorithms: [F-58]
If used for feature selection give less accuracy: lack domain If used for feature selection give less accuracy: lack domain knowledgeknowledge
Linear Regression: [A-73] Linear Regression: [A-73] Distance values between pairs of graphs not known as training set Distance values between pairs of graphs not known as training set
Neural Networks: [B-96] Neural Networks: [B-96] Poor interpretability, hard to incorporate domain knowledge Poor interpretability, hard to incorporate domain knowledge
2121
ConclusionsConclusions
LearnMet proposed to learn semantics-LearnMet proposed to learn semantics-preserving distance metrics for graphspreserving distance metrics for graphs
Minimize error between predicted and actual Minimize error between predicted and actual clusters of graphsclusters of graphs
Ongoing Work: Maximizing accuracy of Ongoing Work: Maximizing accuracy of LearnMetLearnMet
Finding good value for number of pairs per epochFinding good value for number of pairs per epochSelecting components without domain expert inputSelecting components without domain expert inputDefining scaling factors for weight adjustmentsDefining scaling factors for weight adjustments