34
Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Class Website

CX4242:

Data Integration

Mahdi Roozbahani

Lecturer, Computational Science and

Engineering, Georgia Tech

Page 2: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

What is Data Integration?Combining data from multiple sources to

provide the user with a unified view.

Why is it Important?Think about the apps, websites, and

services that you use every day.

Page 3: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Businesses derive value

through data integration.

Page 4: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech
Page 5: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech
Page 6: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech
Page 7: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

More Examples?

• Social media (data from users, businesses)

• Facebook: your posts, advertisements, review

• Search engine: Google, Bing, Yahoo, etc.

• Smart assistants: Siri, Cortana, Alexa, Google

• Price comparison: Kayak

• Uber, Lyft: drivers, traffic data, customers

• google maps: users, restaurants, traffic….

8

Page 8: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

How to do data integration?

Page 9: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

“Low” Effort Approaches

1. Use database’s “Join”! (e.g., SQLite)When does this approach work? (Or, when does it NOT work?)

16

id name salary

111 Smith $40k

222 Johnson $60k

333 Lee $50k

id name

111 Smith

222 Johnson

333 Lee

id salary

111 $40k

222 $60k

333 $50k

2. Open Refinehttp://openrefine.org (Video #3 “Reconcile and Match Data”)

Page 10: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

IDs are really important, and

can simplify data integration!

But who creates the IDs?

17

Page 11: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Crowd-sourcing Approaches: Freebase

18

Freebase intro video: https://youtu.be/TJfrNo3Z-DU

Learn more about Freebase at https://en.wikipedia.org/wiki/Freebase

Page 12: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Freebase(a graph of entities)

“…a large collaborative knowledge base

consisting of metadata composed mainly

by its community members…”

19

Wikipedia.

Page 13: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

So what? What can you do with the

Freebase knowledge graph?

Hint: Google acquired it in 2010.

20

Page 14: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Google Knowledge Graph video: https://youtu.be/mmQl6VGvX-c

Page 15: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Freebase replaced by

Google Knowledge Graph API

23

Example:

What does Google know

about Taylor Swift?

https://developers.google.com/

knowledge-graph/

Page 16: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

24

What does Google know about Taylor Swift?

https://developers.google.com/knowledge-graph/

Page 17: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

What if we don’t have the luxury of having IDs ?

30(Screenshot from FreeBase video)

A common

problem in

academia:

Polo Chau

Duen Horng Chau

Duen Chau

D. Chau

Page 18: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Entity Resolution(A hard problem in data integration)

31

Then you need to do…

Page 19: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Why is entity resolution

so difficult?

Let’s understand it through

shopping for an iPhone on

Apple, Amazon and eBay

Page 20: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech
Page 21: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech
Page 22: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech
Page 23: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

D-DupeInteractive Data Deduplication and Integration

TVCG 2008

University of Maryland

Bilgic, Licamele, Getoor, Kang, Shneiderman

36

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4035746

https://linqspub.soe.ucsc.edu/basilic/web/Publications/2006/bilgic:vast06/

Page 24: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech
Page 25: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Mahdi

Madhi

Alice

Bob

Carol

Dave

Page 26: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Core components: Similarity functions

Determine how two entities are similar.

D-Dupe’s approach:

Attribute similarity + relational similarity

39

Similarity score for a pair of entities

Page 27: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

40

Attribute similarity (a weighted sum)

Page 28: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Properties of Similarity Function

41

Page 29: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Distance Functions for

Vectors

42

d

d

d

d

d

d

d

Page 30: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Example

43

Page 31: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Some problems with Euclidean distance

44

x y

z

d(x,y) and d(x,z) ?

r

Curse of dimensionality

Page 32: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Numerous similarity functions

• Euclidean distanceEuclidean norm / L2 norm

• TaxiCab/Manhattan distance

• Jaccard Similarity (e.g., used with w-shingles)e.g., overlap of nodes’ #neighbors

• String edit distancee.g., “Mahdi” vs “Madhi”

45

http://infolab.stanford.edu/~ullman/mmds/ch3a.pdfExcellent read:

Page 33: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

46

https://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.

html

Page 34: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Excellent Tutorial on Entity Resolution

http://www.umiacs.umd.edu/~getoor/Tutorials/ER_

KDD2013.pdf

by Lise Getoor and Ashwin Machanavajjhala

47