1 Distinguishing Objects with Identical Names in Relational Databases

1

Distinguishing Objects with Identical Names in Relational Databases

2

Motivations• Different objects may share the same name

– In AllMusic.com, 72 songs and 3 albums named “Forgotten” or “The Forgotten”

– In DBLP, 141 papers are written by at least 14 “Wei Wang”

– How to distinguish the authors of the 141 papers?

3

Wei Wang, Jiong Yang, Richard Muntz

VLDB 1997

Jinze Liu, Wei Wang ICDM 2004

Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu

SIGMOD 2002

Jiong Yang, Hwanjo Yu, Wei Wang, Jiawei Han

CSB 2003

Jiong Yang, Jinze Liu, Wei Wang KDD 2004

Wei Wang, Haifeng Jiang, Hongjun Lu,

Jeffrey Yu

VLDB 2004

Hongjun Lu, Yidong Yuan, Wei Wang, Xuemin Lin

ICDE 2005

Wei Wang, Xuemin Lin ADMA 2005

Haixun Wang, Wei Wang, Baile Shi, Peng Wang

ICDM 2005

Yongtai Zhu, Wei Wang, Jian Pei, Baile Shi, Chen Wang

KDD 2004

Aidong Zhang, Yuqing Song, Wei Wang

WWW 2003

Wei Wang, Jian Pei, Jiawei Han

CIKM 2002

Jian Pei, Daxin Jiang, Aidong Zhang

ICDE 2005

Jian Pei, Jiawei Han, Hongjun Lu, et al.

ICDM 2001

(1) Wei Wang at UNC (2) Wei Wang at UNSW, Australia(3) Wei Wang at Fudan Univ., China (4) Wei Wang at SUNY Buffalo

(1)

(3)

(2)

(4)

4

Challenges of Object Distinction• Related to duplicate detection, but

– Textual similarity cannot be used– Different references appear in different contexts

(e.g., different papers), and thus seldom share common attributes

– Each reference is associated with limited information

• We need to carefully design an approach and use all information we have

5

Overview of DISTINCT• Measure similarity between references

– Linkages between references• Based on our empirical study, references to the same

object are more likely to be connected

– Neighbor tuples of each reference• Can indicate similarity between their contexts

• References clustering– Group references according to their similarities

6

Similarity 1: Link-based Similarity• Indicate the overall strength of connections

between two references– We use random walk probability between the two

tuples containing the references– Random walk probabilities along different join

paths are handled separately• Because different join paths have different semantic

meanings• Only consider join paths of length at most 2L (L is the

number of steps of propagating probabilities)

7

Example of Random Walk

Wei Wangvldb/wangym97

vldb/wangym97STING: A Statistical Information Grid

Approach to Spatial Data Mining vldb/vldb97

vldb/vldb97 Very Large Data Bases 1997 Athens, Greece

Jiong Yangvldb/wangym97

Richard Muntzvldb/wangym97

Jiong Yang

Richard Muntz

AuthorsPublish

Publications

Proceedings

1.0

1.0

0.5

0.5

0.5

0.5

1.0

8

Path Decomposition• It is very expensive to propagate IDs and

probabilities along a long join path• Divide each join path into two parts with equal (or

almost equal) length– Compute the probability from each target object to each

tuple, and from each tuple to each target object

0.1/0.25

0.5/0.5

0.4/0.5

0/0

R2

0.2/0.5

0.2/1

0.2/1

0.2/0.5

0.2/0.5

1/1

R1Rt0/0

0/0

Origin of probability propagation

9

Path Decomposition (cont.)• When computing probability of walking from

one target object to another– Combine the forward and backward probabilities

R1

R2

t1

t2

t3

t4

… …

… …

… …… …

… …

… …

… …… …

R1: 0.1/0.25 R2: 0.2/0.1

Probability of walking from O1 to O2 through t1 is 0.1*0.1 = 0.01

10

Similarity 2: Neighborhood Similarity• Find the neighbor tuples of each reference

– Neighbor tuples within L joins

• Weights of neighbor tuples– Different neighbor tuples have different

connection to the reference– Assign each neighbor tuple a weight, which is

the probability of walking from the reference to this tuple

• Similarity: Set resemblance between two sets of neighbor tuples

11

Example of Neighbor Objects

Wei Wangvldb/wangym97Jiong Yang

Richard Muntz

Authors

Publish

Jinze Liu

Haixun Wang

……

0.2

0.2

0.1

0.05

12

Automatically Build A Training Set• Find objects with unique names

– For persons’ names• A person’s name = first name + last name (+middle name)• A rare first name + a rare last name → this name is likely to be

unique• For other types of objects (paper titles, product names, …),

consider the IDF of each term in object names

• Use two references to an object with unique name as a positive example

• Use two references to different objects as a negative example

Johannes Gehrke

13

Selecting Important Join Paths

• Each pair of training objects are connected with certain probability via each join path

• Construct training set: Join path → Feature• Use SVM to learn the weight of each join path

A1 A2

B1 B2

C1 C2

conference

year

coauthor

Join Paths

……

……

……

……

……

……

14

Clustering References• We choose agglomerative hierarchical

clustering because– We do not know number of clusters (real entities)– We only know similarity between references – Equivalent references can be merged into a

cluster, which represents a single entity

15

How to measure similarity between clusters?

• Single-link (highest similarity between points in two clusters) ?– No, because references to different objects can be

connected.

• Complete-link (minimum similarity between them)?– No, because references to the same object may be

weakly connected.

• Average-link (average similarity between points in two clusters)?– A better measure

16

Problem with Average-link

• C2 is close to C1, but their average similarity is low• We use collective random walk probability: Probability of

walking from one cluster to another• Final measure:

Average neighborhood similarity and Collective random walk probability

C1 C2

C3

17

Clustering Procedure• Procedure

– Initialization: Use each reference as a cluster– Keep finding and merging the most similar pair of

clusters– Until no pair of clusters is similar enough

18

Efficient Computation• In agglomerative hierarchical clustering, one needs

to repeatedly compute similarity between clusters– When merging clusters C1 and C2 into C3, we need to

compute the similarity between C3 and any other cluster

– Very expensive when clusters are large

• We invent methods to compute similarity incrementally– Neighborhood similarity

– Random walk probability

19

Experimental Results• Distinguishing references to authors in DBLP

– Accuracy of reference clustering• True positive: Number of pairs of references to same

author in same cluster• False positive: Different authors, in same cluster• False negative: Same author, different clusters• True negative: Different authors, different clusters

Precision = TP/(TP+FP)

Recall = TP/(TP+FN)

f-measure = 2*precision*recall / (precision+recall)

Accuracy = TP/(TP+FP+FN)

20

Accuracy on Synthetic Tests• Select 1000 authors with at least 5 papers

• Merge G (G=2 to 10) authors into one group

• Use DISTINCT to distinguish each group of references

0.3

0.4

0.5

0.6

0.7

1

0.00

5

0.00

1

0.00

02

5.00

E-0

5

1.00

E-0

5

1E-1

0

min-sim

Acc

urac

y

G2

G4

G6

G8

G10

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Pre

cisi

on

G2

G4

G6

G8

G10

21

0

0.2

0.4

0.6

0.8

2 4 6 8 10

Group size

Max

acc

urac

y

DISTINCTSet resemblance - unsuper.Random walk - unsuper.Combined

0

0.2

0.4

0.6

0.8

2 4 6 8 10

Group size

Max

f-m

easu

reDISTINCTSet resemblance - unsuper.Random walk - unsuper.Combined

Compare with “Existing Approaches”• Random walk and neighborhood similarity have

been used in duplicate detection• We combine them with our clustering approaches

for comparison

22

Real CasesName #author #ref accuracy precision recall f-measure

Hui Fang 3 9 1.0 1.0 1.0 1.0

Ajay Gupta 4 16 1.0 1.0 1.0 1.0

Joseph Hellerstein 2 151 0.81 1.0 0.81 0.895

Rakesh Kumar 2 36 1.0 1.0 1.0 1.0

Michael Wagner 5 29 0.395 1.0 0.395 0.566

Bing Liu 6 89 0.825 1.0 0.825 0.904

Jim Smith 3 19 0.829 0.888 0.926 0.906

Lei Wang 13 55 0.863 0.92 0.932 0.926

Wei Wang 14 141 0.716 0.855 0.814 0.834

Bin Yu 5 44 0.658 1.0 0.658 0.794

average 0.81 0.966 0.836 0.883

23

Real Cases: Comparison

0.4

0.5

0.6

0.7

0.8

0.9

1

accuracy f-measure

DISTINCT

Supervised setresemblance

Supervised randomwalkUnsupervisedcombined measure

Unsupervised setresemblance

Unsupervisedrandom walk

24

Distinguishing Different “Wei Wang”s

UNC-CH (57)

Fudan U, China(31)

UNSW, Australia(19)

SUNY Buffalo

(5)

Beijing Polytech

(3)

NU Singapore

(5)

Harbin UChina

(5)

Zhejiang UChina

(3)

Najing NormalChina

(3)

Ningbo TechChina

(2)

Purdue(2)

Beijing U ComChina(2)

Chongqing UChina

(2)

SUNYBinghamton

(2)

5

6

2

25

0

4

8

12

16

0 50 100 150 200 250 300 350

#references

Tim

e (s

econ

d)

Scalability• Agglomerative hierarchical clustering takes

quadratic time– Because it requires computing pair-wise similarity

26

Thank you!

27

Self-loop Phenomenon• In a relational database, each object connects to

other objects via different join paths– In DBLP: author → paper → conference → …

• An object usually connects to itself more frequently than most other objects of same type– E.g. an author in DBLP usually connects to himself

more frequently than to other authors

Charu Aggarwal TKDE vol16

TKDE vol15

KDD

TKDEassociate

editor

KDD04

Self-loops: connect to herself via proceedings, conferences, areas, …

28

An Experiment• Consider a relational database as a graph

– Use DBLP as an example• Randomly select an author x

– Perform random walks starting from x• With 8 steps

– For each author y, calculate probability of reaching y from x

• Not considering paths like “x→a→ … →a→x”– Rank all authors by the probability

• Perform the above procedure for 500 authors (with at least 10 papers) who are randomly selected from 31608 authors

29

Results• For each selected author x, rank all authors

by random walk probability from x

• Rank of x himselfRank of x within: Portion of authors with such “self-rank”

in the 500 authors

Top 0.1% 62%

Top 1% 86%

Top 10% 99%

Top 50% 100%

30

Results (curves)

0%

20%

40%

60%

80%

100%

0.01% 0.05% 0.1% 0.5% 1% 5% 10% 50%

Self-rank

Port

ion

of tu

ples

|

#path

Random walk

0%

20%

40%

60%

80%

100%

1% 2% 5% 10% 20% 50% 100%

Self-rank

Port

ion

of tu

ples

|

#path

Random walk

DBLP Database CS Dept Database

Documents

1 Distinguishing Objects with Identical Names in Relational Databases