21
Data discovery and reuse TTF-IP Workshop, 18.3. Monday, March 21, 2011

Data discovery and reuse

Embed Size (px)

DESCRIPTION

Slides for workshop led by Friedrich Lindenberg and Jonathan Gray at "Use of Information and Data for EnhancedCommunication and Advocacy" workshop in Budapest, 17th March 2011.

Citation preview

Page 1: Data discovery and reuse

Data discovery and reuse

TTF-IP Workshop, 18.3.

Monday, March 21, 2011

Page 2: Data discovery and reuse

Data processes 1

•Need: machine-readable, openly licensed.

•Re-publish derived data

Monday, March 21, 2011

Page 3: Data discovery and reuse

Data processes 2

•Goal: reproducible results, ecosystems:

•Tools to regularly extract data, share, transform and load it.

•Catalogues, documentation.

•“Data-the-process”, not “data-the-file”

•no “Excel Afternoons”

Monday, March 21, 2011

Page 4: Data discovery and reuse

Data Acquisition

VoluntaryRelease

InvolunatryRelease

Active acquisition FoI Scraping

Passiveacquisition PSI/Open Data Leaks

Monday, March 21, 2011

Page 5: Data discovery and reuse

Basic tools

•Language “convention”: Python

•Screen scraping: ScraperWiki

•Semi-structured storage: MongoDB

•Keeping an overview: CKAN

Monday, March 21, 2011

Page 6: Data discovery and reuse

Monday, March 21, 2011

Page 7: Data discovery and reuse

Textual data

•De-PDF (Acrobat Pro, pdf2text)

•Index & Search (Apache Solr)

•Basic NLP: Word counts/freqs, NEE etc.: nltk

•Publish: Co-ment, AnnotateIt, Scribd

•Soon: DocumentClouds for all!

Monday, March 21, 2011

Page 8: Data discovery and reuse

Monday, March 21, 2011

Page 9: Data discovery and reuse

Monday, March 21, 2011

Page 10: Data discovery and reuse

Monday, March 21, 2011

Page 11: Data discovery and reuse

Monday, March 21, 2011

Page 12: Data discovery and reuse

Monday, March 21, 2011

Page 13: Data discovery and reuse

Numeric data I•De-PDF: ABBYY FineReader

•Munge & Massage: Google Refine

•Share & Extend: Google Spreadsheets

•R/Stata/SPSS: more suited to internal processes.

•BI/Analytics/Aggregation: custom?

Monday, March 21, 2011

Page 14: Data discovery and reuse

Monday, March 21, 2011

Page 15: Data discovery and reuse

Numeric data II

•Visualization, first go: Google Vis Toolkit, IBM ManyEyes

•Visualization, interactive: Protovis, Raphael

•Flash considered harmful :-(

Monday, March 21, 2011

Page 16: Data discovery and reuse

Monday, March 21, 2011

Page 17: Data discovery and reuse

Monday, March 21, 2011

Page 18: Data discovery and reuse

Network data

•Can be derived from other types

•Think about structure: nodes, edges, weights, directions

•Analysis: find central actors, mediators, ...: MCI, NetworkX

•Visualization: Gephi, GraphViz

Monday, March 21, 2011

Page 19: Data discovery and reuse

Monday, March 21, 2011

Page 20: Data discovery and reuse

FTS “Afghanistan”, Ronny PatzMonday, March 21, 2011

Page 21: Data discovery and reuse

Geo-data

•There is more than Google Maps markers :-)

•Talk to your neighborhood OSM crowd.

Monday, March 21, 2011