37
Theory & Practice of Data Cleaning Introduction to OpenRefine

Theory & Practice of Data Cleaning: Introduction to OpenRefine

Embed Size (px)

Citation preview

Page 1: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Theory & Practice ofData Cleaning

Introduction to OpenRefine

Page 2: Theory & Practice of Data Cleaning: Introduction to OpenRefine

A First Look at OpenRefine

• Creating a New Project• Basic Normalization• Different Facets (text, timeline, scatterplot)• Clustering and Mass Edits• Operation History: Provenance

• Separate videos:– Installing OpenRefine– Advanced Operations

2

Page 3: Theory & Practice of Data Cleaning: Introduction to OpenRefine

OpenRefine Overview

• OpenRefine is a power tool for data “wrangling”, specifically:– for getting an overview (exploring and “profiling”) data– for detecting and cleaning certain data errors – for transforming and linking data

• History:– Freebase Gridworks ... Google Refine … OpenRefine

3

Page 4: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Dataset Examples

• Working with two datasets:

– USDA Directory of Farmers Markets• smaller, more curated (?) data

– New York Public Library collection on historic restaurant menus

• very “messy”, crowd-sourced data

4

Page 5: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Example: USDA Farmers Market Data

5

Page 6: Theory & Practice of Data Cleaning: Introduction to OpenRefine

6

Page 7: Theory & Practice of Data Cleaning: Introduction to OpenRefine

7

Page 8: Theory & Practice of Data Cleaning: Introduction to OpenRefine

OpenRefine: Create Project

8

Page 9: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Importing Data …

9

Page 10: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Voilà! 8664 rows imported ...

10

Page 11: Theory & Practice of Data Cleaning: Introduction to OpenRefine

The Text Facet “workhorse” …

11

Now hit “Cluster” …

Page 12: Theory & Practice of Data Cleaning: Introduction to OpenRefine

… and the magic happens!

12

Page 13: Theory & Practice of Data Cleaning: Introduction to OpenRefine

… select some (all!?) clusters and merge ...

13

Page 14: Theory & Practice of Data Cleaning: Introduction to OpenRefine

... resulting in a mass edit …

14.. also reduced the choices from 8095 to 7846…

Page 15: Theory & Practice of Data Cleaning: Introduction to OpenRefine

… (in this case): Done with Normalization of MarketName column

15

Page 16: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Undo/Redo: Operation History (Provenance)

16

Page 17: Theory & Practice of Data Cleaning: Introduction to OpenRefine

More Data Profiling: Timeline Facet

17

Page 18: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Timeline facet: hmm ... not working!?

18

Page 19: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Converting from String to Date!

19

Page 20: Theory & Practice of Data Cleaning: Introduction to OpenRefine

.. now we’re in business!

20

Page 21: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Exploring time slices: missing data …

21

Page 22: Theory & Practice of Data Cleaning: Introduction to OpenRefine

… and slices with detailed data!

22

Page 23: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Converting from String to Number …

23

Page 24: Theory & Practice of Data Cleaning: Introduction to OpenRefine

... from String to Number: Done!

24

Page 25: Theory & Practice of Data Cleaning: Introduction to OpenRefine

The x,y (longitude,latitude) data lets us use a “gem”: Scatterplot Facet!

25

Page 26: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Georeferenced data & (Google) maps!

26

Page 27: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Dealing with more messy and more complex data issues …

… The NYPL Menus Project!

27

Page 28: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Example: NYPL “Menu” Collection

28

Page 29: Theory & Practice of Data Cleaning: Introduction to OpenRefine

29

Page 30: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Unpacking and selection Menu.csv …

30

Page 31: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Complex Data Quality Issues: To fix or not to fix?

vs.

The same restaurant styled differently in 1907 (left) and 1916 (right)

• Relying on volunteer transcription will often result in inconsistent data entry

• Even well-transcribed data is subject to challenges due to synonyms and spelling variants, etc.

• Also: entities change over time…

e.g., Childs’ restaurant, originally launched by brothers Samuel and William

Childs in 1889, grew to be one of the first national dining chains and dropped its

apostrophe sometime after 1907.

Page 32: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Basic Normalization

Page 33: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Faceting and Clustering

Page 34: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Faceting and Clustering

Page 35: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Kinds of Clustering

• Key collision (fastest, safest)– Fingerprint, Ngram Fingerprint = defaults

• Match normalized strings in different ways

– Metaphone = English pronunciation

• Nearest Neighbor– PPM = Partial matching– Levenshtein = edit distance

Page 36: Theory & Practice of Data Cleaning: Introduction to OpenRefine

But beware: Clustering Caveat!Hotel Savoy59th St. & 5th Ave.New York, New York

Savoy HotelStrandLondon WC2R 0EUUnited Kingdom

Page 37: Theory & Practice of Data Cleaning: Introduction to OpenRefine

Summary: A First Look at OpenRefine

• Creating a New Project• Basic Normalization• Different Facets (text, timeline, scatterplot)• Clustering and Mass Edits• Operation History: Provenance

• Separate videos:– Installing OpenRefine– Advanced Operations

37