Beyond Kaggle: Solving Data Science Challenges at Scale

1

DRAFT

Think Big, Start Smart, Scale Fast

Dato ConferenceData Matching and Deduplication

using Dato ToolkitsJuly 21st, 2015

Guillermo Breto Rangel, PhD

2

DRAFT

Entity Resolution: Multiple Definitions

2

(ER)Entity Resolution

Extract, match and disambiguate entity records in data.

3

DRAFT

Extract, match and disambiguate entity records in data.

Entity Resolution: Real World Entity

Matching real world entities with profiles, mentions...

You

Facebook account(s)LinkedIn profile(s)TweetsGoogle Searches

Many recordsUnique Identities…

...…...

......

ER

4

DRAFT

Entity Resolution: Use Cases

4

◆ Network Analysis ◆ Vocabulary Normalization:

Different organizations report different names for same entities

◆ Network Security: Finding user actions/intents

◆ Data Cleaning: removing duplicated records

◆ Metadata enrichment: records when matched append metadata to the entity.

5

DRAFT

Entity Resolution: Challenges

5

◆ Missing Values

◆ Data entry errors

◆ Abbreviations and formatting

◆ Data volume

◆ Variety of raw data sourceso free text, semi-structured, streaming

◆ Data integration from multiple sources

◆ Preprocessing

◆ Normalization

◆ Choosing similarity metrics

6

DRAFT

Dataset: Dbpedia/Amazon-Google Products

6

Putting a schema to WikipediaCrowd-sourced community project

Queries against WikipediaData Match data sets on the Web to Wikipedia data

A set of triples → <dbpedia:Luc_Besson> <dbpedia-owl:spouse><dbpedia:Milla_Jovovich>

Matching Amazon Products and Google Products

Deich Library and

7

DRAFT

Preprocessing: Steps

7

1) Extracttokens

2) Cleantriplets

3) Pivottable

4) Selectrelevantfeatures

5) Normalization

6) Choosingsimilaritymetrics

8

DRAFT

Algorithm: Nearest Neighbors

8

● The entity resolution problem is approached as a network problem○ Nodes: entity records○ Edges: similarity measures

● Define distance between entities to find the nearest neighbors. Composite distances could be built using euclidean, squared euclidean, levenshtein, Jaccard, Manhattan, cosine, dot product

● Compute the distance between all entities and find the nearest neighbors

● Duplicates are the connected components of the graph which are labeled as an entity

● Some parameters to keep in mind are:○ Grouping_features○ k (number of neighbors to compare)○ Radius (the distance threshold)

9

DRAFT

Results:

9

The benchmark results can be found at:

https://github.com/cubreto/dataDeduplication

10

DRAFT

Lessons Learned:

10

◆ Most of the time spent on preprocessing

◆ Hard to define the distance threshold

◆ Weighting the composite distance

◆ Data volume

◆ Dealing with missing values

◆ Tuning the parameters

◆ Finding exact matches

11

DRAFT

Some Resources/Bibliography

11

◆ Ricardo Vasquez Sierra, PhD: Senior Data Scientist from Ooyala

◆ Kevin Glynn, MS: Data Scientist and Khan Academy Instructor

◆ Vince Gonzalez: MapR Software Engineer◆ Alexey Svyatkovskiy, PhD: BigData Scientist

Princeton University◆ Ashwin Machanavajjhala, PhD: Professor of

Computer Science, Duke University◆ Lise Getoor, PhD: Professor of Computer

Science, UC Santa Cruzo KDDTutorialonEntityResolution inBigDatao Deduplication and Group Detection using Links, Indrajit

Bhattacharya and Lise Getoor, The 10th ACM SIGKDD Workshop on

Link Analysis and Group Detection (LinkKDD-04).

o Collective Entity Resolution in Relational Data, Indrajit Bhattacharya

and Lise Getoor, ACM Transactions on Knowledge Discovery from

Data (ACM-TKDD), 2007

◆ The Dato Team◆ My colleagues at Think Big

http://www.cs.umd.edu/~getoor/Tutorials/ER_KDD2013.pdf



http://linqs.cs.umd.edu/basilic/web/Publications/2004/bhattacharya:kdd04-wkshp/

http://linqs.cs.umd.edu/basilic/web/Publications/2007/bhattacharya:tkdd07/

Technology

Beyond Kaggle: Solving Data Science Challenges at Scale