Upload
ashwin-dinoriya
View
132
Download
0
Embed Size (px)
Citation preview
Advances in Data ScienceFall 2016
TUTORIAL
INTRODUCTION
FEATURES
INSTALLATION
DEMO
COMPARISON
WHAT IS …
??
• Formerly known as Google Refine
OpenRefine is a power tool for working with messy data, primarily for
• detecting and fixing inconsistencies • transforming data from one structure or format to
another • extending it with web services and external data• connecting names within your data to name
registries (databases)
Use OpenRefine when you need something ...
• more powerful than a spreadsheet• more interactive and visual than scripting• more provisional / exploratory / experimental / . playful than a database
• Import data in various formats (Ex: TSV, CSV,Excel (.xls, xlsx),XML,RDF as XML,JSON)
• Explore datasets in a matter of seconds
• Apply basic and advanced cell transformations
• Deal with cells that contain multiple values
• Create instantaneous links between datasets
• Filter and partition your data easily with regular expressions
• Use named-entity extraction on full-text fields to automatically identify topics
• Perform advanced data operations with the General Refine Expression Language
IMPORTANT FEATURES:
The LendingClub data contains complete loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information
LENDING CLUB LOAN STATS DATA
Our aim is to perform exploratory analysis on given financial data
• Getting the data
• Looking at the data
• Cleansing
• Transforming
• Creating visualizations
STEPS
1 – Getting started with OpenRefine
2 – Analyzing and Fixing Data
3 – Advanced Data Operations
4 – Linking Datasets
5 – Regular Expressions and GREL
TUTORIAL
• Requirements• Java JRE installed
• Download• OpenRefine is a desktop application. Here’s the link: Google OpenRefine• Unlike most other desktop applications, it runs as a small web server on
your own computer • You point your web browser at that web server in order to use Refine. So,
think of Refine as a personal and private web application
HOW TO INSTALL
• Install: • Once you have downloaded the .zip file, uncompress it into a folder wherever you want (such as in
C:\Google-Refine).
• Run: • Run the .exe file in that folder. You should see the Command window in which OpenRefine runs. By
default, the Command window has a black background and text in monospace font in it.
• Shut down: • When you need to shut down OpenRefine, switch to that Command window, and press Ctrl-C. Wait
until there's a message that says the shutdown is complete. That window might close automatically, or you can close it yourself. If you get asked, "Terminate all batch processes? Y/N", just press Y.
INSTALLATION: WINDOWS
• Install: • Once you have downloaded the .dmg file, open it, and drag the OpenRefine icon into
the Applications folder icon (just like you would normally install Mac applications).
• Run: • To launch OpenRefine, go to the Applications folder and double click the OpenRefine
app. You'll see the OpenRefine app appear in your dock.
• Shut down: • You can switch to the OpenRefine app (clicking on its icon in the dock) and invoke its
Quit command.
• If you use Yosemite you will need to install Java for OS X 2014-001 first.
INSTALLATION: MAC
• Install / Run: Once you have downloaded the tar.gz file, open a shell and type
• tar xzf google-refine.tar.gz
• cd google-refine
• ./refine
• This will start OpenRefine and open your browser to its starting page.
• Shut down: Press Ctrl-C in the shell.
INSTALLATION: LINUX
RUN OPENREFINE
• To increase memory: refine.bat /m 4096m
IMPORT DATA
EXPLORING DATA
MANIPULATING COLUMNS
USING THE PROJECT HISTORY
EXPORTING A PROJECT
ANALYZING AND FIXING DATA
WORKING ON THE DATA• sorting data
• faceting data
• detecting duplicates
• applying a text filter
• using simple cell transformations
• removing matching rows
• splitting data across columns
• adding derived columns
SPECIAL FEATURE• Regular Expressions and GREL
• Can use Python, Clojure
ADDING A RECONCILIATION SERVICE ANDRECONCILING WITH LINKED DATA
ADVANCED DATA OPERATIONS• handling multi-valued cells
• alternating between rows and records mode
• clustering similar cells
• transforming cell values
• adding derived columns
• transposing rows and columns
• installing extensions
• Documentation: • https://github.com/OpenRefine/OpenRefine/wiki
• Youtube Tutorial:• https://www.youtube.com/playlist?list=PL737054C67FCC0741
REFERENCES: