30
1 Distinguishing Objects with Identical Names in Relational Databases

1 Distinguishing Objects with Identical Names in Relational Databases

Embed Size (px)

Citation preview

Page 1: 1 Distinguishing Objects with Identical Names in Relational Databases

1

Distinguishing Objects with Identical Names in Relational Databases

Page 2: 1 Distinguishing Objects with Identical Names in Relational Databases

2

Motivations• Different objects may share the same name

– In AllMusic.com, 72 songs and 3 albums named “Forgotten” or “The Forgotten”

– In DBLP, 141 papers are written by at least 14 “Wei Wang”

– How to distinguish the authors of the 141 papers?

Page 3: 1 Distinguishing Objects with Identical Names in Relational Databases

3

Wei Wang, Jiong Yang, Richard Muntz

VLDB 1997

Jinze Liu, Wei Wang ICDM 2004

Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu

SIGMOD 2002

Jiong Yang, Hwanjo Yu, Wei Wang, Jiawei Han

CSB 2003

Jiong Yang, Jinze Liu, Wei Wang KDD 2004

Wei Wang, Haifeng Jiang, Hongjun Lu,

Jeffrey Yu

VLDB 2004

Hongjun Lu, Yidong Yuan, Wei Wang, Xuemin Lin

ICDE 2005

Wei Wang, Xuemin Lin ADMA 2005

Haixun Wang, Wei Wang, Baile Shi, Peng Wang

ICDM 2005

Yongtai Zhu, Wei Wang, Jian Pei, Baile Shi, Chen Wang

KDD 2004

Aidong Zhang, Yuqing Song, Wei Wang

WWW 2003

Wei Wang, Jian Pei, Jiawei Han

CIKM 2002

Jian Pei, Daxin Jiang, Aidong Zhang

ICDE 2005

Jian Pei, Jiawei Han, Hongjun Lu, et al.

ICDM 2001

(1) Wei Wang at UNC (2) Wei Wang at UNSW, Australia(3) Wei Wang at Fudan Univ., China (4) Wei Wang at SUNY Buffalo

(1)

(3)

(2)

(4)

Page 4: 1 Distinguishing Objects with Identical Names in Relational Databases

4

Challenges of Object Distinction• Related to duplicate detection, but

– Textual similarity cannot be used– Different references appear in different contexts

(e.g., different papers), and thus seldom share common attributes

– Each reference is associated with limited information

• We need to carefully design an approach and use all information we have

Page 5: 1 Distinguishing Objects with Identical Names in Relational Databases

5

Overview of DISTINCT• Measure similarity between references

– Linkages between references• Based on our empirical study, references to the same

object are more likely to be connected

– Neighbor tuples of each reference• Can indicate similarity between their contexts

• References clustering– Group references according to their similarities

Page 6: 1 Distinguishing Objects with Identical Names in Relational Databases

6

Similarity 1: Link-based Similarity• Indicate the overall strength of connections

between two references– We use random walk probability between the two

tuples containing the references– Random walk probabilities along different join

paths are handled separately• Because different join paths have different semantic

meanings• Only consider join paths of length at most 2L (L is the

number of steps of propagating probabilities)

Page 7: 1 Distinguishing Objects with Identical Names in Relational Databases

7

Example of Random Walk

Wei Wangvldb/wangym97

vldb/wangym97STING: A Statistical Information Grid

Approach to Spatial Data Mining vldb/vldb97

vldb/vldb97 Very Large Data Bases 1997 Athens, Greece

Jiong Yangvldb/wangym97

Richard Muntzvldb/wangym97

Jiong Yang

Richard Muntz

AuthorsPublish

Publications

Proceedings

1.0

1.0

0.5

0.5

0.5

0.5

1.0

Page 8: 1 Distinguishing Objects with Identical Names in Relational Databases

8

Path Decomposition• It is very expensive to propagate IDs and

probabilities along a long join path• Divide each join path into two parts with equal (or

almost equal) length– Compute the probability from each target object to each

tuple, and from each tuple to each target object

0.1/0.25

0.5/0.5

0.4/0.5

0/0

R2

0.2/0.5

0.2/1

0.2/1

0.2/0.5

0.2/0.5

1/1

R1Rt0/0

0/0

Origin of probability propagation

Page 9: 1 Distinguishing Objects with Identical Names in Relational Databases

9

Path Decomposition (cont.)• When computing probability of walking from

one target object to another– Combine the forward and backward probabilities

R1

R2

t1

t2

t3

t4

… …

… …

… …… …

… …

… …

… …… …

R1: 0.1/0.25 R2: 0.2/0.1

Probability of walking from O1 to O2 through t1 is 0.1*0.1 = 0.01

Page 10: 1 Distinguishing Objects with Identical Names in Relational Databases

10

Similarity 2: Neighborhood Similarity• Find the neighbor tuples of each reference

– Neighbor tuples within L joins

• Weights of neighbor tuples– Different neighbor tuples have different

connection to the reference– Assign each neighbor tuple a weight, which is

the probability of walking from the reference to this tuple

• Similarity: Set resemblance between two sets of neighbor tuples

Page 11: 1 Distinguishing Objects with Identical Names in Relational Databases

11

Example of Neighbor Objects

Wei Wangvldb/wangym97Jiong Yang

Richard Muntz

Authors

Publish

Jinze Liu

Haixun Wang

……

0.2

0.2

0.1

0.05

Page 12: 1 Distinguishing Objects with Identical Names in Relational Databases

12

Automatically Build A Training Set• Find objects with unique names

– For persons’ names• A person’s name = first name + last name (+middle name)• A rare first name + a rare last name → this name is likely to be

unique• For other types of objects (paper titles, product names, …),

consider the IDF of each term in object names

• Use two references to an object with unique name as a positive example

• Use two references to different objects as a negative example

Johannes Gehrke

Page 13: 1 Distinguishing Objects with Identical Names in Relational Databases

13

Selecting Important Join Paths

• Each pair of training objects are connected with certain probability via each join path

• Construct training set: Join path → Feature• Use SVM to learn the weight of each join path

A1 A2

B1 B2

C1 C2

conference

year

coauthor

Join Paths

……

……

……

……

……

……

Page 14: 1 Distinguishing Objects with Identical Names in Relational Databases

14

Clustering References• We choose agglomerative hierarchical

clustering because– We do not know number of clusters (real entities)– We only know similarity between references – Equivalent references can be merged into a

cluster, which represents a single entity

Page 15: 1 Distinguishing Objects with Identical Names in Relational Databases

15

How to measure similarity between clusters?

• Single-link (highest similarity between points in two clusters) ?– No, because references to different objects can be

connected.

• Complete-link (minimum similarity between them)?– No, because references to the same object may be

weakly connected.

• Average-link (average similarity between points in two clusters)?– A better measure

Page 16: 1 Distinguishing Objects with Identical Names in Relational Databases

16

Problem with Average-link

• C2 is close to C1, but their average similarity is low• We use collective random walk probability: Probability of

walking from one cluster to another• Final measure:

Average neighborhood similarity and Collective random walk probability

C1 C2

C3

Page 17: 1 Distinguishing Objects with Identical Names in Relational Databases

17

Clustering Procedure• Procedure

– Initialization: Use each reference as a cluster– Keep finding and merging the most similar pair of

clusters– Until no pair of clusters is similar enough

Page 18: 1 Distinguishing Objects with Identical Names in Relational Databases

18

Efficient Computation• In agglomerative hierarchical clustering, one needs

to repeatedly compute similarity between clusters– When merging clusters C1 and C2 into C3, we need to

compute the similarity between C3 and any other cluster

– Very expensive when clusters are large

• We invent methods to compute similarity incrementally– Neighborhood similarity

– Random walk probability

Page 19: 1 Distinguishing Objects with Identical Names in Relational Databases

19

Experimental Results• Distinguishing references to authors in DBLP

– Accuracy of reference clustering• True positive: Number of pairs of references to same

author in same cluster• False positive: Different authors, in same cluster• False negative: Same author, different clusters• True negative: Different authors, different clusters

Precision = TP/(TP+FP)

Recall = TP/(TP+FN)

f-measure = 2*precision*recall / (precision+recall)

Accuracy = TP/(TP+FP+FN)

Page 20: 1 Distinguishing Objects with Identical Names in Relational Databases

20

Accuracy on Synthetic Tests• Select 1000 authors with at least 5 papers

• Merge G (G=2 to 10) authors into one group

• Use DISTINCT to distinguish each group of references

0.3

0.4

0.5

0.6

0.7

1

0.00

5

0.00

1

0.00

02

5.00

E-0

5

1.00

E-0

5

1E-1

0

min-sim

Acc

urac

y

G2

G4

G6

G8

G10

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Pre

cisi

on

G2

G4

G6

G8

G10

Page 21: 1 Distinguishing Objects with Identical Names in Relational Databases

21

0

0.2

0.4

0.6

0.8

2 4 6 8 10

Group size

Max

acc

urac

y

DISTINCTSet resemblance - unsuper.Random walk - unsuper.Combined

0

0.2

0.4

0.6

0.8

2 4 6 8 10

Group size

Max

f-m

easu

reDISTINCTSet resemblance - unsuper.Random walk - unsuper.Combined

Compare with “Existing Approaches”• Random walk and neighborhood similarity have

been used in duplicate detection• We combine them with our clustering approaches

for comparison

Page 22: 1 Distinguishing Objects with Identical Names in Relational Databases

22

Real CasesName #author #ref accuracy precision recall f-measure

Hui Fang 3 9 1.0 1.0 1.0 1.0

Ajay Gupta 4 16 1.0 1.0 1.0 1.0

Joseph Hellerstein 2 151 0.81 1.0 0.81 0.895

Rakesh Kumar 2 36 1.0 1.0 1.0 1.0

Michael Wagner 5 29 0.395 1.0 0.395 0.566

Bing Liu 6 89 0.825 1.0 0.825 0.904

Jim Smith 3 19 0.829 0.888 0.926 0.906

Lei Wang 13 55 0.863 0.92 0.932 0.926

Wei Wang 14 141 0.716 0.855 0.814 0.834

Bin Yu 5 44 0.658 1.0 0.658 0.794

average 0.81 0.966 0.836 0.883

Page 23: 1 Distinguishing Objects with Identical Names in Relational Databases

23

Real Cases: Comparison

0.4

0.5

0.6

0.7

0.8

0.9

1

accuracy f-measure

DISTINCT

Supervised setresemblance

Supervised randomwalkUnsupervisedcombined measure

Unsupervised setresemblance

Unsupervisedrandom walk

Page 24: 1 Distinguishing Objects with Identical Names in Relational Databases

24

Distinguishing Different “Wei Wang”s

UNC-CH (57)

Fudan U, China(31)

UNSW, Australia(19)

SUNY Buffalo

(5)

Beijing Polytech

(3)

NU Singapore

(5)

Harbin UChina

(5)

Zhejiang UChina

(3)

Najing NormalChina

(3)

Ningbo TechChina

(2)

Purdue(2)

Beijing U ComChina(2)

Chongqing UChina

(2)

SUNYBinghamton

(2)

5

6

2

Page 25: 1 Distinguishing Objects with Identical Names in Relational Databases

25

0

4

8

12

16

0 50 100 150 200 250 300 350

#references

Tim

e (s

econ

d)

Scalability• Agglomerative hierarchical clustering takes

quadratic time– Because it requires computing pair-wise similarity

Page 26: 1 Distinguishing Objects with Identical Names in Relational Databases

26

Thank you!

Page 27: 1 Distinguishing Objects with Identical Names in Relational Databases

27

Self-loop Phenomenon• In a relational database, each object connects to

other objects via different join paths– In DBLP: author → paper → conference → …

• An object usually connects to itself more frequently than most other objects of same type– E.g. an author in DBLP usually connects to himself

more frequently than to other authors

Charu Aggarwal TKDE vol16

TKDE vol15

KDD

TKDEassociate

editor

KDD04

Self-loops: connect to herself via proceedings, conferences, areas, …

Page 28: 1 Distinguishing Objects with Identical Names in Relational Databases

28

An Experiment• Consider a relational database as a graph

– Use DBLP as an example• Randomly select an author x

– Perform random walks starting from x• With 8 steps

– For each author y, calculate probability of reaching y from x

• Not considering paths like “x→a→ … →a→x”– Rank all authors by the probability

• Perform the above procedure for 500 authors (with at least 10 papers) who are randomly selected from 31608 authors

Page 29: 1 Distinguishing Objects with Identical Names in Relational Databases

29

Results• For each selected author x, rank all authors

by random walk probability from x

• Rank of x himselfRank of x within: Portion of authors with such “self-rank”

in the 500 authors

Top 0.1% 62%

Top 1% 86%

Top 10% 99%

Top 50% 100%

Page 30: 1 Distinguishing Objects with Identical Names in Relational Databases

30

Results (curves)

0%

20%

40%

60%

80%

100%

0.01% 0.05% 0.1% 0.5% 1% 5% 10% 50%

Self-rank

Port

ion

of tu

ples

|

#path

Random walk

0%

20%

40%

60%

80%

100%

1% 2% 5% 10% 20% 50% 100%

Self-rank

Port

ion

of tu

ples

|

#path

Random walk

DBLP Database CS Dept Database