23
Intro to Open Refine An overview & walkthrough to get you started.

TXDHC OpenRefine Training

Embed Size (px)

Citation preview

Intro to Open RefineAn overview & walkthrough to get you started.

intro/overview (15 min) walkthrough (45 min) intro to advanced (10 min) q&a (20 min)

http://www.txdhc.org/txdhc-training-webcast-materials/

Jennifer Hecker Liz Grumbach

“a tool for workingwith messy data”

Cleaning up data that is: in a simple tabular format is inconsistently formatted has inconsistent terminology

get an overview of a data set resolve inconsistencies split data up into more granular parts match local data up to other data sets enhance a data set with data from

other sources

https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/text-facet-openrefine.png

https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/clustering-openrefine.png

https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/clustering-openrefine.png

Freebase Gridworks=

GoogleRefine=

OpenRefine=

Refine

…ask some questions about your data set: What type of data is it & what format is it in?

What’s the size of your data set?

What question do you want to ask your data?

What do you need to do to find the answer?

Excelfamiliarity, better for data entry, cut and paste operation, no paging to navigate

Google Spreadsheets similar to Excel, can get external data relatively easily, easy to collaborate and share

Google Fusion Tables if you just want to filter, easy to share

Text editor powerful text editor can do many things

Unix tools more challenging to use, but quick and some things (finding things, sorting) are easy

Writing code most sophisticated and most to learn!

<And now Liz attempts the dangerous LIVE DEMO!>

Regular expressions “wildcards on steroids” that allow for

more granular data manipulation

(http://www.regular-expressions.info)

Transformations using Open Refine Expression Language (GREL) kind of like a formula in Excel

Retrieve data from online sources example: use names to retrieve birth/death dates

from Virtual International Authority File (VIAF)

Match data to external data sources using Extensions for RDF, DBpedia, Named-Entity

Recognition (NER), etc…

And ‘reconciliation’ services

Use ‘cross’ function to compare

contents of two Refine projects, or

share data between the two projects.

TxDHC blog post on this webinar http://www.txdhc.org/txdhc-training-

webcast-materials/

The OpenRefine Wiki https://github.com/OpenRefine/OpenRefine/wiki

OpenRefine User Documentation

https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users

The ‘Free your metadata’ site http://freeyourmetadata.org...

…and book http://book.freeyourmetadata.org

The OpenRefine mailing list and forum

http://groups.google.com/d/forum/openrefine

http://bit.ly/1uGPd0f

Please email us if you have any questions: Jennifer = [email protected]

Liz = [email protected]

credits * acknowledgements * citationsThese slides were developed by Jennifer Hecker ([email protected]) and Liz Grumbach ([email protected] ) on behalf of University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, Media and Culture, and the Texas Digital Humanities Consortium using many resources including the wonderful course material developed by Owen Stephens on behalf of the British Library (http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/).

Unless otherwise stated, all images, audio or video content are separate works with their own license, and should not be assumed to be CC-BY in their own right. This work is licensed under a Creative Commons Attribution 4.0 International License http://creativecommons.org/licenses/by/4.0/. It is suggested when crediting this work, you include the phrase “Developed by Liz Grumback and Jennifer Hecker on behalf of the university of Texas, Texas A&M, and the TXDHC.”

Thanks to University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, and the Texas Digital Humanities Consortium for facilitating this presentation.