29
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Embed Size (px)

Citation preview

Page 1: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

1

Record Linkage: A 10-Year Retrospective

Chen Li and Sharad Mehrotra

UC Irvine

Page 2: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

2

Efficient Record Linkage in Large Data Sets

Liang Jin, Chen Li, Sharad MehrotraUniversity of California, Irvine

DASFAA, Kyoto, Japan, March 2003

Page 3: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1
Page 4: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

How was the paper written?

Two faculty working on different areas, plus

1st year PhD student

Page 5: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

5

Chen’s Story: 2001 …

Page 6: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

6

Data Integration Problems?

Talking to medical doctors…

Page 7: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Example

Name SSN Addr

Jack Lemmon

430-871-8294 Maple St

Harrison Ford

292-918-2913 Culver Blvd

Tom Hanks 234-762-1234 Main St

… … …

Table R

Name SSN Addr

Ton Hanks 234-162-1234 Main Street

Kevin Spacey

928-184-2813 Frost Blvd

Jack Lemon 430-817-8294 Maple Street

… … …

Table S

Q: Find records from different datasets that could be the same entity

7Chen Li

Page 8: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Sharad’s research

8Chen Li

Page 9: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Liang’s story1st-year PhD student at UC Irvine

9Chen Li

Page 10: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Challenges How to define good similarity functions?

How to do matching efficiently?

10Chen Li

Page 11: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

11

Nested-loop? Not desirable for large data sets 5 hours for 30K strings!

Page 12: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

12

Our 2-step approach Step 1: map strings (in a metric

space) to objects in a Euclidean space

Step 2: do a similarity join in the Euclidean space

Page 13: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

13

Advantages Applicable to many metric similarity

functions— E.g.: Edit distance

Open to existing algorithms— Mapping techniques— Join techniques

Page 14: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

14

Step 1Map strings into a high-dimensional Euclidean

space

Metric Space Euclidean Space

Page 15: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

15

Use data set 1 (54K names) as an example k=2, d=20

— Use k’=5.2 to differentiate similar and dissimilar pairs.

Can it preserve distances?

Page 16: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

16

Multi-attribute linkage Example: title + name + year Different attributes have different

similarity functions and thresholds Consider merge rules in disjunctive

format:

Page 17: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

17

Secret of the paper …

Page 18: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

18

Page 19: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

19

Work since then … Chen: efficiency

Sharad: quality

Page 20: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

20

Chen’s Work on Efficiency Gram-based algorithms

— Indexing— Selection algorithms— Join algorithms— Variable-length grams— Selectivity estimation

Trie-based algorithms— Instant search

Page 21: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

The Flamingo Package

http://flamingo.ics.uci.edu/

Page 22: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

22

Follow-up work in the community

Significant amount of work on approximate string queries— Selection— Join

Page 23: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

23

Make an impact?

Page 24: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Chen Li 24

UCI People Search

Page 25: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Chen Li 25

Psearch (2008) : 2 stories

Page 26: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

26

Fuzzy search

Page 27: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

www.omniplaces.com

Location-based search

27

Page 28: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Research commercialization

28Chen Li

Page 29: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Lesson learned: Hands-on experiences important!

29Chen Li