Upload
erika-adams
View
233
Download
8
Embed Size (px)
Citation preview
1
Distinguishing Objects with Identical Names in Relational Databases
2
Motivations• Different objects may share the same name
– In AllMusic.com, 72 songs and 3 albums named “Forgotten” or “The Forgotten”
– In DBLP, 141 papers are written by at least 14 “Wei Wang”
– How to distinguish the authors of the 141 papers?
3
Wei Wang, Jiong Yang, Richard Muntz
VLDB 1997
Jinze Liu, Wei Wang ICDM 2004
Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu
SIGMOD 2002
Jiong Yang, Hwanjo Yu, Wei Wang, Jiawei Han
CSB 2003
Jiong Yang, Jinze Liu, Wei Wang KDD 2004
Wei Wang, Haifeng Jiang, Hongjun Lu,
Jeffrey Yu
VLDB 2004
Hongjun Lu, Yidong Yuan, Wei Wang, Xuemin Lin
ICDE 2005
Wei Wang, Xuemin Lin ADMA 2005
Haixun Wang, Wei Wang, Baile Shi, Peng Wang
ICDM 2005
Yongtai Zhu, Wei Wang, Jian Pei, Baile Shi, Chen Wang
KDD 2004
Aidong Zhang, Yuqing Song, Wei Wang
WWW 2003
Wei Wang, Jian Pei, Jiawei Han
CIKM 2002
Jian Pei, Daxin Jiang, Aidong Zhang
ICDE 2005
Jian Pei, Jiawei Han, Hongjun Lu, et al.
ICDM 2001
(1) Wei Wang at UNC (2) Wei Wang at UNSW, Australia(3) Wei Wang at Fudan Univ., China (4) Wei Wang at SUNY Buffalo
(1)
(3)
(2)
(4)
4
Challenges of Object Distinction• Related to duplicate detection, but
– Textual similarity cannot be used– Different references appear in different contexts
(e.g., different papers), and thus seldom share common attributes
– Each reference is associated with limited information
• We need to carefully design an approach and use all information we have
5
Overview of DISTINCT• Measure similarity between references
– Linkages between references• Based on our empirical study, references to the same
object are more likely to be connected
– Neighbor tuples of each reference• Can indicate similarity between their contexts
• References clustering– Group references according to their similarities
6
Similarity 1: Link-based Similarity• Indicate the overall strength of connections
between two references– We use random walk probability between the two
tuples containing the references– Random walk probabilities along different join
paths are handled separately• Because different join paths have different semantic
meanings• Only consider join paths of length at most 2L (L is the
number of steps of propagating probabilities)
7
Example of Random Walk
Wei Wangvldb/wangym97
vldb/wangym97STING: A Statistical Information Grid
Approach to Spatial Data Mining vldb/vldb97
vldb/vldb97 Very Large Data Bases 1997 Athens, Greece
Jiong Yangvldb/wangym97
Richard Muntzvldb/wangym97
Jiong Yang
Richard Muntz
AuthorsPublish
Publications
Proceedings
1.0
1.0
0.5
0.5
0.5
0.5
1.0
8
Path Decomposition• It is very expensive to propagate IDs and
probabilities along a long join path• Divide each join path into two parts with equal (or
almost equal) length– Compute the probability from each target object to each
tuple, and from each tuple to each target object
0.1/0.25
0.5/0.5
0.4/0.5
0/0
R2
0.2/0.5
0.2/1
0.2/1
0.2/0.5
0.2/0.5
1/1
R1Rt0/0
0/0
Origin of probability propagation
9
Path Decomposition (cont.)• When computing probability of walking from
one target object to another– Combine the forward and backward probabilities
R1
R2
t1
t2
t3
t4
… …
… …
… …… …
… …
… …
… …… …
R1: 0.1/0.25 R2: 0.2/0.1
Probability of walking from O1 to O2 through t1 is 0.1*0.1 = 0.01
10
Similarity 2: Neighborhood Similarity• Find the neighbor tuples of each reference
– Neighbor tuples within L joins
• Weights of neighbor tuples– Different neighbor tuples have different
connection to the reference– Assign each neighbor tuple a weight, which is
the probability of walking from the reference to this tuple
• Similarity: Set resemblance between two sets of neighbor tuples
11
Example of Neighbor Objects
Wei Wangvldb/wangym97Jiong Yang
Richard Muntz
Authors
Publish
Jinze Liu
Haixun Wang
……
0.2
0.2
0.1
0.05
12
Automatically Build A Training Set• Find objects with unique names
– For persons’ names• A person’s name = first name + last name (+middle name)• A rare first name + a rare last name → this name is likely to be
unique• For other types of objects (paper titles, product names, …),
consider the IDF of each term in object names
• Use two references to an object with unique name as a positive example
• Use two references to different objects as a negative example
Johannes Gehrke
13
Selecting Important Join Paths
• Each pair of training objects are connected with certain probability via each join path
• Construct training set: Join path → Feature• Use SVM to learn the weight of each join path
A1 A2
B1 B2
C1 C2
conference
year
coauthor
Join Paths
……
……
……
……
……
……
14
Clustering References• We choose agglomerative hierarchical
clustering because– We do not know number of clusters (real entities)– We only know similarity between references – Equivalent references can be merged into a
cluster, which represents a single entity
15
How to measure similarity between clusters?
• Single-link (highest similarity between points in two clusters) ?– No, because references to different objects can be
connected.
• Complete-link (minimum similarity between them)?– No, because references to the same object may be
weakly connected.
• Average-link (average similarity between points in two clusters)?– A better measure
16
Problem with Average-link
• C2 is close to C1, but their average similarity is low• We use collective random walk probability: Probability of
walking from one cluster to another• Final measure:
Average neighborhood similarity and Collective random walk probability
C1 C2
C3
17
Clustering Procedure• Procedure
– Initialization: Use each reference as a cluster– Keep finding and merging the most similar pair of
clusters– Until no pair of clusters is similar enough
18
Efficient Computation• In agglomerative hierarchical clustering, one needs
to repeatedly compute similarity between clusters– When merging clusters C1 and C2 into C3, we need to
compute the similarity between C3 and any other cluster
– Very expensive when clusters are large
• We invent methods to compute similarity incrementally– Neighborhood similarity
– Random walk probability
19
Experimental Results• Distinguishing references to authors in DBLP
– Accuracy of reference clustering• True positive: Number of pairs of references to same
author in same cluster• False positive: Different authors, in same cluster• False negative: Same author, different clusters• True negative: Different authors, different clusters
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
f-measure = 2*precision*recall / (precision+recall)
Accuracy = TP/(TP+FP+FN)
20
Accuracy on Synthetic Tests• Select 1000 authors with at least 5 papers
• Merge G (G=2 to 10) authors into one group
• Use DISTINCT to distinguish each group of references
0.3
0.4
0.5
0.6
0.7
1
0.00
5
0.00
1
0.00
02
5.00
E-0
5
1.00
E-0
5
1E-1
0
min-sim
Acc
urac
y
G2
G4
G6
G8
G10
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Pre
cisi
on
G2
G4
G6
G8
G10
21
0
0.2
0.4
0.6
0.8
2 4 6 8 10
Group size
Max
acc
urac
y
DISTINCTSet resemblance - unsuper.Random walk - unsuper.Combined
0
0.2
0.4
0.6
0.8
2 4 6 8 10
Group size
Max
f-m
easu
reDISTINCTSet resemblance - unsuper.Random walk - unsuper.Combined
Compare with “Existing Approaches”• Random walk and neighborhood similarity have
been used in duplicate detection• We combine them with our clustering approaches
for comparison
22
Real CasesName #author #ref accuracy precision recall f-measure
Hui Fang 3 9 1.0 1.0 1.0 1.0
Ajay Gupta 4 16 1.0 1.0 1.0 1.0
Joseph Hellerstein 2 151 0.81 1.0 0.81 0.895
Rakesh Kumar 2 36 1.0 1.0 1.0 1.0
Michael Wagner 5 29 0.395 1.0 0.395 0.566
Bing Liu 6 89 0.825 1.0 0.825 0.904
Jim Smith 3 19 0.829 0.888 0.926 0.906
Lei Wang 13 55 0.863 0.92 0.932 0.926
Wei Wang 14 141 0.716 0.855 0.814 0.834
Bin Yu 5 44 0.658 1.0 0.658 0.794
average 0.81 0.966 0.836 0.883
23
Real Cases: Comparison
0.4
0.5
0.6
0.7
0.8
0.9
1
accuracy f-measure
DISTINCT
Supervised setresemblance
Supervised randomwalkUnsupervisedcombined measure
Unsupervised setresemblance
Unsupervisedrandom walk
24
Distinguishing Different “Wei Wang”s
UNC-CH (57)
Fudan U, China(31)
UNSW, Australia(19)
SUNY Buffalo
(5)
Beijing Polytech
(3)
NU Singapore
(5)
Harbin UChina
(5)
Zhejiang UChina
(3)
Najing NormalChina
(3)
Ningbo TechChina
(2)
Purdue(2)
Beijing U ComChina(2)
Chongqing UChina
(2)
SUNYBinghamton
(2)
5
6
2
25
0
4
8
12
16
0 50 100 150 200 250 300 350
#references
Tim
e (s
econ
d)
Scalability• Agglomerative hierarchical clustering takes
quadratic time– Because it requires computing pair-wise similarity
26
Thank you!
27
Self-loop Phenomenon• In a relational database, each object connects to
other objects via different join paths– In DBLP: author → paper → conference → …
• An object usually connects to itself more frequently than most other objects of same type– E.g. an author in DBLP usually connects to himself
more frequently than to other authors
Charu Aggarwal TKDE vol16
TKDE vol15
KDD
TKDEassociate
editor
KDD04
Self-loops: connect to herself via proceedings, conferences, areas, …
28
An Experiment• Consider a relational database as a graph
– Use DBLP as an example• Randomly select an author x
– Perform random walks starting from x• With 8 steps
– For each author y, calculate probability of reaching y from x
• Not considering paths like “x→a→ … →a→x”– Rank all authors by the probability
• Perform the above procedure for 500 authors (with at least 10 papers) who are randomly selected from 31608 authors
29
Results• For each selected author x, rank all authors
by random walk probability from x
• Rank of x himselfRank of x within: Portion of authors with such “self-rank”
in the 500 authors
Top 0.1% 62%
Top 1% 86%
Top 10% 99%
Top 50% 100%
30
Results (curves)
0%
20%
40%
60%
80%
100%
0.01% 0.05% 0.1% 0.5% 1% 5% 10% 50%
Self-rank
Port
ion
of tu
ples
|
#path
Random walk
0%
20%
40%
60%
80%
100%
1% 2% 5% 10% 20% 50% 100%
Self-rank
Port
ion
of tu
ples
|
#path
Random walk
DBLP Database CS Dept Database