View
1
Download
0
Category
Preview:
Citation preview
What is a Data Commons and Why Should You Care?
Robert Grossman University of Chicago
Open Cloud Consor@um
April 22, 2015 NASA IS&T Colloquium
Collect data and distribute files via DAAC and apply data mining
Make data available via open APIs and apply data science
2000
2010-‐2015
2020
???
1. Data Commons
We have a problem … The commodi@za@on of sensors is crea@ng an explosive growth of data
It can take weeks to download large geo-‐spa@al datasets
Analyzing the data is more expensive than producing it
There is not enough funding for every researcher to house all the data they need
Data Commons
Data commons co-‐locate data, storage and compu@ng infrastructure, and commonly used tools for analyzing and sharing data to create a resource for the research community.
Source: Interior of one of Google’s Data Center, www.google.com/about/datacenters/
The Tragedy of the Commons
Source: GarreY Hardin, The Tragedy of the Commons, Science, Volume 162, Number 3859, pages 1243-‐1248, 13 December 1968.
Individuals when they act independently following their self interests can deplete a deplete a common resource, contrary to a whole group's long-‐term best interests.
GarreY Hardin
7 www.opencloudconsor@um.org
• U.S based not-‐for-‐profit corpora@on. • Manages cloud compu@ng infrastructure to support
scien@fic research: Open Science Data Cloud, Project Matsu, & OCC NOAA Data Commons.
• Manages cloud compu@ng infrastructure to support medical and health care research: Biomedical Commons Cloud.
• Manages cloud compu@ng testbeds: Open Cloud Testbed.
What Scale? • New data centers are some@mes divided into “pods,” which can be built out as needed.
• A reasonable scale for what is needed for a commons is one of these pods (“cyberpod”)
• Let’s use the term “datapod” for the analy@c infrastructure that scales to a cyberpod.
• Think of as the scale out of a database.
Pod A Pod B
experimental science
simula@on science
data science
1609 30x
1670 250x
1976 10x-‐100x
2004 10x-‐100x
Core Data Commons Services
• Digital IDs • Metadata services • High performance transport • Data export • Pay for compute with images/containers containing commonly used tools, applica@ons and services, specialized for each research community
Cloud 1 Cloud 3
Data Commons 1 Commons provide data to other commons and to clouds
Research projects producing data
Research scien@sts at research center B
Research scien@sts at research center C
Research scien@sts at research center A downloading data
Community develops open source sojware stacks for commons and clouds
Cloud 2 Data
Commons 2
Complex sta@s@cal models over small data that are highly manual and update infrequently.
Simpler sta@s@cal models over large data that are highly automated and update frequently.
memory databases
GB TB PB
W KW MW
datapods
cyber pods
Is More Different? Do New Phenomena Emerge at Scale in Biomedical Data?
Source: P. W. Anderson, More is Different, Science, Volume 177, Number 4047, 4 August 1972, pages 393-‐396.
2. OCC Data Commons
matsu.opensciencedatacloud.org
OCC-‐NASA Collabora@on 2009 -‐ present
• Public-‐private data collabora@ve announced April 21, 2015 by Secretary of Commerce Pritzker.
• AWS, Google, Microsoj and Open Cloud Consor@um will form four collabora@ons.
OSDC Commons Architecture
Object storage (permanent)
Scalable light weight workflow
Community data products (data harmoniza@on)
Data submission portal
Open APIs for data access and data access portal
Co-‐located “pay for compute”
Digital ID Service & Metadata Service
Devops suppor@ng virtual machines and containers
3. Scanning Queries over Commons and the Matsu Wheel
What is the Project Matsu?
Matsu is an open source project for processing satellite imagery to support earth sciences researchers using a data commons.
Matsu is a joint project between the Open Cloud Consor@um and NASA’s EO-‐1 Mission (Dan Mandl, Lead)
All available L1G images (2010-‐now)
NASA’s Matsu Mashup
Flood/Drought Dashboard Examples
GeoSocial API Consumer embedded in Dashboard
Ini@al crowdsourcing func@onality (pictures, GPS features and water edge loca@ons )
GeoSocial API used to discover Radarsat product in area (User can see registra@on error)
1. Open Science Data Cloud (OSDC) stores Level 0 data from EO-‐1 and uses an OpenStack-‐based cloud to create Level 1 data.
2. OSDC also provides OpenStack resources for the Nambia Flood Dashboard developed by Dan Mandl’s team.
3. Project Matsu uses a Hadoop/Accumulo system to run analy@cs nightly and to create @les with OGC-‐compliant WMTS.
Amount of data retrieved
Number of queries
mashup
re-‐analysis
“wheel”
row-‐oriented column-‐oriented
done by staff
self-‐service by community
Matsu Hadoop Architecture
Hadoop HDFS
Matsu Web Map Tile Service
Matsu MR-‐based Tiling Service
NoSQL Database(Accumulo)
Images at different zoom layers suitable for OGC Web Mapping Server
Level 0, Level 1 and Level 2 images
MapReduce used to process Level n to Level n+1 data and to par@@on images for different zoom levels
NoSQL-‐based Analy@c Services
Streaming Analy@c Services
MR-‐based Analy@c Services
Analy@c Services Storage for WMTS @les and derived data products
Presenta@on Services
Web Coverage Processing Service
(WCPS)
Workflow Services
The Matsu Wheel for analyzing large volumes of hyperspectral image data
Spectral anomaly detected: Barren Island active volcano, Feb, 2014
Spectral anomaly detected: Nishinoshima active volcano, Dec, 2014
Spectral anomaly detected: North Sentinel Island fires, May, 2014
Spectral anomaly detected: North Sentinel Island fires, May, 2014
Spectral anomaly detected: Colima Volcano, April 14, 2015
Spectral anomaly detected: Colima Volcano, April 14, 2015
Matsu Wheel Spectral Anomaly Detector
§ “Contours and Clusters” – looks for physical contours around spectral clusters
§ PCA analysis applied to the set of reflectivity values (spectra) for every pixel, and the top 5 components are extracted for further analysis.
Matsu Wheel Spectral Anomaly Detector § “Contours and Clusters” – looks for physical contours around
spectral clusters § PCA analysis applied to the set of reflectivity values (spectra) for
every pixel, and the top 5 components are extracted for further analysis.
§ Pixels are clustered in the transformed 5-D spectral space using a k-means clustering algorithm.
§ For each image, k = 50 spectral clusters are formed and ranked from most to least extreme using the Mahalanobis distance of the cluster from the spectral center.
§ For each spectral cluster, adjacent pixels are grouped together into contiguous objects.
à returns geographic regions of spectral anomalies that are scored again as anomalous (0 least , 1000 most) compared to a set of “normal” spectra, constructed for comparison over a baseline of time
Wheel analytic (beta): SVM-based land cover classifier
Matsu Wheel Land Cover Classifier § Support Vector Machine supervised classifier (uses Python’s Sci-
kit learn) § “Training set” constructed from a variety of scenes with range
of locations, time of year, sun angle § Cloud, Desert/ Dry Land, Water, Vegetation are manually
(visually) classified using RGB image
à returns classified image
4. Data Peering
Tier 1 ISPs “Created” the Internet
Amount of data retrieved
Number of queries
Number of sites
download data
data peering
Cloud 1
Data Commons 1
Data Commons 2
Data Peering
• Tier 1 Commons exchange data for the research community at no charge.
Three Requirements Two Research Data Commons with a Tier 1 data peering rela@onship agree as follows: 1. To transfer research data between them at no cost. 2. To peer with at least two other Tier 1 Research Data
Commons at 10 Gbps or higher. 3. To support Digital IDs (of a form to be determined
by mutual agreement) so that a researcher using infrastructure associated with one Tier 1 Research Data Commons can access data transparently from any of the Tier 1 Research Data Commons that hold the desired data.
5. Five Challenges for Data Commons
The 5P Challenges
• Permanent objects with Digital IDs
• Cyber Pods with scalable storage and analy@cs • Data Peering • Portable data • Support for Pay for compute
Challenge 1: Permanent Secure Objects
• How do I assign Digital IDs and key metadata to “controlled access” data objects and collec@ons of data objects to support distributed computa@on of large datasets by communi@es of researchers? – Metadata may be both public and controlled access – Objects must be secure
• Think of this as a “dns for data.” • The test: One commons serving the earth science
community can transfer 1 PB of data files to another commons and no data scien@st needs to change their code
Challenge 2: Cyber Pods and Datapods
• How can I add a rack of compu@ng/storage/networking equipment to a cyber pod (that has a manifest) so that – Ajer aYaching to power – Ajer aYaching to network – No other manual configura@on is required – The data services can make use of the addi@onal infrastructure
– The compute services can make use of the addi@onal infrastructure
• In other words, we need an open source sojware stack that scales to cyberpods and data analysis that scales to datapods.
Challenge 3: Data Peering
• How can a cri@cal mass of data commons support data peering so that a research at one of the commons can transparently access data managed by one of the other commons – We need to access data independent of where it is stored
– “Tier 1 data commons” need to pass community managed data at no cost
– We need to be able to transport large data efficiently “end to end” between commons
Challenge 4: Data Portability
• We need an “Indigo BuYon” to move our data between two commons that peer.
Challenge 5: Pay for Compute Challenges – Low Cost Data Integra@on
• Commons should support a “free storage for research data, pay for compute,” model perhaps with “chits” available to researchers.
• Today, we by and large integrate data with graduate students and technical staff
• How can two datasets from two different commons be “joined” at “low cost” – Linked data – Controlled Vocabularies – Dataspaces – Universal Correla@on Keys – Sta@s@cal methods
Collect data and distribute files via DAAC and apply data mining
Make data available via open APIs and apply data science
2000
2010-‐2015
2020
Operate data commons & support data peering
Ques@ons?
51
For more informa@on: rgrossman.com @bobgrossman
Recommended