23
Advances in Data Science Fall 2016 TUTORIAL

OpenRefine Class Tutorial

Embed Size (px)

Citation preview

Page 1: OpenRefine Class Tutorial

Advances in Data ScienceFall 2016

TUTORIAL

Page 2: OpenRefine Class Tutorial

INTRODUCTION

FEATURES

INSTALLATION

DEMO

COMPARISON

Page 3: OpenRefine Class Tutorial

WHAT IS …

??

• Formerly known as Google Refine

OpenRefine is a power tool for working with messy data, primarily for

• detecting and fixing inconsistencies • transforming data from one structure or format to

another • extending it with web services and external data• connecting names within your data to name

registries (databases)

Use OpenRefine when you need something ...

• more powerful than a spreadsheet• more interactive and visual than scripting• more provisional / exploratory / experimental / . playful than a database

Page 4: OpenRefine Class Tutorial

• Import data in various formats (Ex: TSV, CSV,Excel (.xls, xlsx),XML,RDF as XML,JSON)

• Explore datasets in a matter of seconds

• Apply basic and advanced cell transformations

• Deal with cells that contain multiple values

• Create instantaneous links between datasets

• Filter and partition your data easily with regular expressions

• Use named-entity extraction on full-text fields to automatically identify topics

• Perform advanced data operations with the General Refine Expression Language

IMPORTANT FEATURES:

Page 5: OpenRefine Class Tutorial

The LendingClub data contains complete loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information

LENDING CLUB LOAN STATS DATA

Our aim is to perform exploratory analysis on given financial data

Page 6: OpenRefine Class Tutorial

• Getting the data

• Looking at the data

• Cleansing

• Transforming

• Creating visualizations

STEPS

Page 7: OpenRefine Class Tutorial

1 – Getting started with OpenRefine

2 – Analyzing and Fixing Data

3 – Advanced Data Operations

4 – Linking Datasets

5 – Regular Expressions and GREL

TUTORIAL

Page 8: OpenRefine Class Tutorial

• Requirements• Java JRE installed

• Download• OpenRefine is a desktop application. Here’s the link: Google OpenRefine• Unlike most other desktop applications, it runs as a small web server on

your own computer • You point your web browser at that web server in order to use Refine. So,

think of Refine as a personal and private web application

HOW TO INSTALL

Page 9: OpenRefine Class Tutorial

• Install: • Once you have downloaded the .zip file, uncompress it into a folder wherever you want (such as in

C:\Google-Refine).

• Run: • Run the .exe file in that folder. You should see the Command window in which OpenRefine runs. By

default, the Command window has a black background and text in monospace font in it.

• Shut down: • When you need to shut down OpenRefine, switch to that Command window, and press Ctrl-C. Wait

until there's a message that says the shutdown is complete. That window might close automatically, or you can close it yourself. If you get asked, "Terminate all batch processes? Y/N", just press Y.

INSTALLATION: WINDOWS

Page 10: OpenRefine Class Tutorial

• Install: • Once you have downloaded the .dmg file, open it, and drag the OpenRefine icon into

the Applications folder icon (just like you would normally install Mac applications).

• Run: • To launch OpenRefine, go to the Applications folder and double click the OpenRefine

app. You'll see the OpenRefine app appear in your dock.

• Shut down: • You can switch to the OpenRefine app (clicking on its icon in the dock) and invoke its

Quit command.

• If you use Yosemite you will need to install Java for OS X 2014-001 first.

INSTALLATION: MAC

Page 11: OpenRefine Class Tutorial

• Install / Run: Once you have downloaded the tar.gz file, open a shell and type

• tar xzf google-refine.tar.gz

• cd google-refine

• ./refine

• This will start OpenRefine and open your browser to its starting page.

• Shut down: Press Ctrl-C in the shell.

INSTALLATION: LINUX

Page 12: OpenRefine Class Tutorial

RUN OPENREFINE

• To increase memory: refine.bat /m 4096m

Page 13: OpenRefine Class Tutorial

IMPORT DATA

Page 14: OpenRefine Class Tutorial

EXPLORING DATA

Page 15: OpenRefine Class Tutorial

MANIPULATING COLUMNS

Page 16: OpenRefine Class Tutorial

USING THE PROJECT HISTORY

Page 17: OpenRefine Class Tutorial

EXPORTING A PROJECT

Page 18: OpenRefine Class Tutorial

ANALYZING AND FIXING DATA

Page 19: OpenRefine Class Tutorial

WORKING ON THE DATA• sorting data

• faceting data

• detecting duplicates

• applying a text filter

• using simple cell transformations

• removing matching rows

• splitting data across columns

• adding derived columns

Page 20: OpenRefine Class Tutorial

SPECIAL FEATURE• Regular Expressions and GREL

• Can use Python, Clojure

Page 21: OpenRefine Class Tutorial

ADDING A RECONCILIATION SERVICE ANDRECONCILING WITH LINKED DATA

Page 22: OpenRefine Class Tutorial

ADVANCED DATA OPERATIONS• handling multi-valued cells

• alternating between rows and records mode

• clustering similar cells

• transforming cell values

• adding derived columns

• transposing rows and columns

• installing extensions

Page 23: OpenRefine Class Tutorial

• Documentation: • https://github.com/OpenRefine/OpenRefine/wiki

• Youtube Tutorial:• https://www.youtube.com/playlist?list=PL737054C67FCC0741

REFERENCES: