© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unlocking Open Data in the Cloud
Grischa GundelsweilerPublic Sector Account Manager, DACHLoft + Lab Munich11th November 2016
What this session is about
1) Open Data: Concepts, Examples & Trends2) AWS as a Platform for Open Data3) Case Study: Provide Open Data on AWS4) Case Study: Use Open Data on AWS
2
“Open data is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose.”
Definition by Open Knowledge Foundation, 2013http://blog.okfn.org/2013/10/03/defining-open-data/
The 8 Open Government Data Principles
1. Complete2. Primary3. Timely4. Accessible5. Machine processable6. Non-discriminatory7. Non-proprietary8. License-free OGD Principles
https://opengovdata.org/
Why Open Data?
1. Transparency
2. Releasing social and commercial value
3. Participation and engagement
8
McKinsey report from October 2013http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/open-data-unlocking-innovation-and-performance-with-liquid-information
9EC study from November 2015: Creating Value through Open Data: Study on the Impact of Re-use of Public Data Resources https://www.europeandataportal.eu/sites/default/files/edp_creating_value_through_open_data_0.pdf
Why does AWS care about Open Data?
� Many of our commercial sector customers rely on quality open data as much as they rely on our cloud infrastructure services.
� Many of our public sector customers use AWS to make their data available to a global community of researchers, entrepreneurs, students, and fellow government agencies.
Sharing data makes it accessible to a large and growing community of researchers, entrepreneurs, and enterprises.
19
The cloud allows users from anywhere to take their algorithms to data rather than downloading data to their computing resources.
Data Acquisition in the Cloud
20
Open data as a platform
Data Creation Data Enrichment
Sen
sem
akin
g
Data at Rest(Object storage)
Basic APIs
Complex APIs
Consumerapplications
Algorithmicpolicy
Data-drivenjournalism
Data Catalogs
Focused datadashboards
Predictivemodeling
Visualizations
Lower cost of knowledge(Efficiency)
21
A Rich Set of Programmable Services
22
Administrationand Security
Access Control
Identity Management
Key Management and Storage
Monitoringand Logs
Resource and Usage Auditing
Platform Services
Analytics App Services Developer Tools and Operations Mobile Services
DataPipelines
DataWarehouse
Hadoop
Real-TimeStreaming Data
Application LifecycleManagement
Containers
Deployment
DevOps
Event-Driven Computing
Resource Templates Identity
Mobile Analytics
Push Notifications
Sync
App Streaming
Queuing and Notifications
Search
Transcoding
Workflow
Core Services CDNCompute(VMs, Auto-Scaling and Load Balancing)
Databases(Relational, NoSQL, and Caching)
Networking(VPC, DX, and DNS)
Storage(Object, Block, and Archival)
Infrastructure Availability Zones
Points of Presence
Regions
EnterpriseApplications
Business Email
Sharing and Collaboration
Virtual Desktop
Technical and Business Support
AccountManagement
PartnerEcosystem
ProfessionalServices
Security and Pricing Reports
SolutionsArchitectsSupport Training and
Certification
Why open data at TfL?
TransparencyReachOptimal use of transport networkEconomic benefitInnovation…
26
Available Datasets
The API supports all the data requirements of the TfLwebsite. Every data-driven aspect of the website (including maps) is powered by the unified API.
Some of the multi-modal core datasets included and available to developers are:� Journey Planning (current and
future)� Status (current and future)� Disruptions (current) and Planned
works (future)� Arrival/departure predictions
(instant and websockets)� Timetables� Embarkation points and facilities� Routes and lines (topology and
geographical)� Fares
27
London
28
Munich
Almost 500 apps produced.Playground for innovation.Improving transportation, collaboratively.
Apps by public transportationauthorities: MVV, MVG, DB. No info how to access data, lacksdocumentation.
Outcomes Cloud Benefits
� Customers save time, economic benefits
� New jobs and investmentsin startup and techecosystem
� Usage of data has sincedoubled
� Data consolidation andquality
� Pay for what you use� Lower maintenance costs� Elasticity� Automation and consistency� Blue/green deployment –
zero downtime� Highly secure
30 mwd advisors cased study https://d0.awsstatic.com/analyst-reports/MWD_AWS_TFL_Case_Study_Sept_2015.pdf
Solutions for providing Open Data on AWS
Open data platforms� Catalog� Publish� Discover� Visualize� Analyze� Share� …
31
Public Data Sets on AWSSeveral high-value datasets are available for anyone to access for free on AWS. Examples include:
Landsat on AWS3K Rice Genome NEXRAD on AWS
33
More available Public Datasets on AWS…
GDELT: Over a quarter-billion records monitoring the world's broadcast, print, and web news from nearly every corner of every country, updated daily..IRS 990 Filings on AWS: Machine-readable data from certain electronic 990 forms filed with the IRS from 2011 to presentCommon Crawl Corpus: A corpus of web crawl data composed of over 5 billion web pagesTCGA on AWS: Raw and processed genomic, transcriptomic, and epigenomic data from The Cancer Genome Atlas (TCGA) available to qualified researchers via the Cancer Genomics CloudICGC on AWS: Whole genome sequence data available to qualified researchers via The International Cancer Genome Consortium (ICGC)1000 Genomes Project: A detailed map of human genetic variationMultimedia Commons: A collection of nearly 100M images and videos with audio and visual features and annotationsGoogle Books Ngrams: A dataset containing Google Books n-gram corpusesA list of other Public Datasets is available here.
34
Accessing and processing Landsat data
What is Landsat on AWS?
How to access Landsat on AWS?
How to use Landsat on AWS?
36
Landsat on AWS
We have committed to make up to 1 petabyte of Landsat imagery readily available as objects on Amazon S3.
All Landsat 8 scenes from 2015 and 2016 are available, along with a selection of cloud-free scenes from 2013 and 2014.
All new Landsat 8 scenes are made available each day (~700 per day), often within hours of production.
37
Landsat on AWS
Landsat on AWS makes each band of each scene readily available as objects on Amazon S3. Data can be accessed programmatically via HTTP and quickly deployed to any of our products for analysis and processing.
Users do not need to worry about local storage and have access to virtually unlimited computing power on demand.
AmazonEC2
s3://landsat-pds
.tarUSGS
.tiff
38
Undifferentiated heavy lifting
We use GDAL to add “internal tiling” on each Landsat on AWS tiff, which allows developers to use HTTP range gets to access specific portions of each scene.
This allows people to only access the data they need when they need it. Standard tiff
objectInternal tiled tiff
object
1 2 3 4 5 67 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 3031 32 33 34 35 36
1 2 34 5 6
7 8 9
10 11 1213 14 15
16 17 18
19 20 2122 23 2425 26 27
28 29 3031 32 3334 35 36
39
RGBVisible light
InfraredVegetation
Shortwave infraredUrban areas
Think of URLs instead of copiesWellington, New Zealandhttps://landsat-pds.s3.amazonaws.com/L8/072/089/
Using Landsat on S3
Landsat on Amazon
S3
ArcGIS Server on
Amazon EC2
AWS US West Oregon Region
reliable, performant data access
user
Usage in the first year:� Over 400,000 scenes available
� Over 1 billion hits globally
Used for new product development by:
Landsat on AWS
Small invest, big impact:
� Public dataset hosted in FRA
� Apps for agriculture, disaster relief, vegetation monitoring, property taxation, ..
Used for new product development by:
42
Sentinel-2 on AWS
Next steps
Depending on your role, your goals� Use open data in your projects / your organisation� Provide open data from your organisation� Build a new business on open dataAWS offers� Technology platform that constantly evolves� Enablement through workshops, training, ProServ� Customer and partner ecosystem to connect and build
44