View
1.166
Download
0
Category
Preview:
DESCRIPTION
Digital Humanities, Lekce druhá Studia nových médií, 15. 10. 2012
Citation preview
Získáváme, čistíme a ukládáme dataDigital Humanities, Lekce druháJosef Šlerka, Studia nových médií, 15. 10. 2012
ETL (light verze)
Extracting data from outside sources
Transforming it to fit operational needs (which can include quality levels)
Loading it into the end target (database, more specifically, operational data store, data mart or data warehouse)
(viz Wikipedie)
Real-life podle Wiki1. Cycle initiation
2. Build reference data
3. Extract (from sources)
4. Validate
5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
6. Stage (load into staging tables, if used)
Real-life podle Wiki
7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)
8. Publish (to target tables)
9. Archive
10. Clean up
Extractingco se vám bude hodit...
Extract
strukturovaná data vs nestrukturovaná
pro DH nejčastěji databáze vs web
web API vs scrapping
lze si vystačit i jen malým znalostmi
statická data vs real-time mohou být zákeřná, ale jde to řešit
XPATH
XPath, the XML Path Language, is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C)
Jednoduché nástrojeGoogle Docs (hlavně statická data)
http://drive.google.com
YQL (hlavně statická data)
http://developer.yahoo.com/yql/console/
Yahoo Pipes (hlavně dynamická data)
http://pipes.yahoo.com/pipes/
IFTTT (hlavně dynamická data)
https://ifttt.com/
Ale mocné....
Twitter Archiving Google Spreadsheet TAGS v3
http://mashe.hawksey.info/2012/01/twitter-archive-tagsv3/
TransformingHlavně o čištění a sjednocování dat ...
Google Refine
http://code.google.com/p/google-refine/downloads/list?can=1
Google Refine is a standalone desktop application provided by Google for data cleanup and transformation to other formats. It is similar to spreadsheet applications (and can work with spreadsheet file formats), however acts more like database.
Loadingkam s nimi, když ne do tradiční databáze...
Google Fusion Tablesjednoduché řešení pro běžné uživatele
http://www.google.com/fusiontables/Home/
Web service provided by Google for data management. Data is stored in multiple tables that Internet users can view and download. The Web service provides means for visualizing data with pie charts, bar charts, lineplots, scatterplots, timelines as well as geographical maps. Data is exported in a comma-separated values file format.
A teď ještě jedno demo....
Recommended