Získáváme, čistíme a ukládáme data

Preview:

DESCRIPTION

Digital Humanities, Lekce druhá Studia nových médií, 15. 10. 2012

Citation preview

Získáváme, čistíme a ukládáme dataDigital Humanities, Lekce druháJosef Šlerka, Studia nových médií, 15. 10. 2012

ETL (light verze)

Extracting data from outside sources

Transforming it to fit operational needs (which can include quality levels)

Loading it into the end target (database, more specifically, operational data store, data mart or data warehouse)

(viz Wikipedie)

Real-life podle Wiki1. Cycle initiation

2. Build reference data

3. Extract (from sources)

4. Validate

5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)

6. Stage (load into staging tables, if used)

Real-life podle Wiki

7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)

8. Publish (to target tables)

9. Archive

10. Clean up

Extractingco se vám bude hodit...

Extract

strukturovaná data vs nestrukturovaná

pro DH nejčastěji databáze vs web

web API vs scrapping

lze si vystačit i jen malým znalostmi

statická data vs real-time mohou být zákeřná, ale jde to řešit

XPATH

XPath, the XML Path Language, is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C)

Jednoduché nástrojeGoogle Docs (hlavně statická data)

http://drive.google.com

YQL (hlavně statická data)

http://developer.yahoo.com/yql/console/

Yahoo Pipes (hlavně dynamická data)

http://pipes.yahoo.com/pipes/

IFTTT (hlavně dynamická data)

https://ifttt.com/

TransformingHlavně o čištění a sjednocování dat ...

Google Refine

http://code.google.com/p/google-refine/downloads/list?can=1

Google Refine is a standalone desktop application provided by Google for data cleanup and transformation to other formats. It is similar to spreadsheet applications (and can work with spreadsheet file formats), however acts more like database.

Loadingkam s nimi, když ne do tradiční databáze...

Google Fusion Tablesjednoduché řešení pro běžné uživatele

http://www.google.com/fusiontables/Home/

Web service provided by Google for data management. Data is stored in multiple tables that Internet users can view and download. The Web service provides means for visualizing data with pie charts, bar charts, lineplots, scatterplots, timelines as well as geographical maps. Data is exported in a comma-separated values file format.

A teď ještě jedno demo....

Recommended