View
218
Download
1
Embed Size (px)
Citation preview
Archiving derived and temporally changing geospatial data in LEAD
Beth Plale
Department of Computer Science
School of Informatics
Indiana University
LEAD (Linked Environments for Atmospheric Discovery) dynamic, adaptive forecasting of mesoscale severe storms
GGF leveraged: Service-oriented architecture, moving to WSRF, WS-Notification,
service registry, Globus RLS, OGSA-DAI Beth Plale, IU
data subsystem architecture, myLEAD personal information space, “VO” catalog
Dennis Gannon, IU workflow (GBPEL), portal/science gateway, Teragrid, XSUL, notification
Oklahoma Univ -- mesoscale meteorology Unidata -- IDD, LDM NCSA -- brokering UNC (Reed) -- monitoring UAH -- data mining atmospheric data Millersville, Howard University -- 6-12 and UG educ NSF ATM-0331480
Resources
Access services
Resource services
personalWorkspace
browser
personalWorkspace
browser
Access interfaces
GeospatialQueryGUI
GeospatialQueryGUI
Askontology
Askontology
Viz Client(IDV)
Viz Client(IDV)
ResourceCatalog
VO data and compute resources
ResourceCatalog
VO data and compute resources
myLEADUser
Informationspace
myLEADUser
Informationspace
Noesis Ontology
concepts and vocabulary
Noesis Ontology
concepts and vocabulary
Query Service
querymediation
Query Service
querymediation
THREDDSCatalogs
-web browsermetadata
THREDDSCatalogs
-web browsermetadata
Name Service
-single global naming system
Name Service
-single global naming system
Automatedmetadata
generation- a capability
Automatedmetadata
generation- a capability
StreamService
- from LDMto user’s app
StreamService
- from LDMto user’s app
Steerableinstruments
- CASA
Steerableinstruments
- CASA
GridStorage
respository
GridStorage
respository
UnidataData dissemclient (LDM)
UnidataData dissemclient (LDM)
OPeNDAPdata
server
OPeNDAPdata
server
LEAD Data Subsystem Architecture
Petascale data collections increasingly crucial to research and education in science and engineering
Current influential technology factors: Powerful and affordable sensors, processors,
instruments, automated equipment Reductions in storage costs make cost-effective to
maintain large data collections Existence of Internet makes it easier to share data
As result, researchers increasingly conduct research using data originally generated by others. Genomics, climate modeling, demographic studies
Magnitude and breadth of
proliferation of data generation in US Same technological advances that produced
inexpensive digital cameras has enabled new generation of high resolution scientific instruments and sensors
Increasing amount of valuable content is “born digital” and can only be managed, preserved, and used in digital form. Advances in biomedical research depend on building and
preserving complex genomic databases. Research in biodiversity and ecosystems, global climate
change, meteorology, space science depend on abilty to combine vast quantities of digital information with complex models and analytical tools.
Problem Domain: storage, retrieval, access to petascale data collections in science and engineering
Digital data collections* are the foundation for analysis using automated analytical tools
Long-lived data undergoes constant re-analysis for improved algorithms or with alternate use in mind.
Analysis depends not just on sensed or computer-generated data but on the metadata that characterizes the environment and the sensing instrument.
*Data - text, numbers, images, video or movie clips, audio, software, algorithms, equations, models, simulations
*Digital data collections - data itself, and infrastructure, organizations needed to preserve access to the data.
Petascale data sets require new work style
Analysis tools growing more complex Many analysis algorithms are super-linear, often needing N2 or
N3 time to process N data points I/O bandwidth has not kept pace with storage capacity
Capacity increase 100-fold while storage bandwidth increase 10-fold
Too many files (> 1million) for a local file system to manage File name and directory hierarchy not enough
Can’t download dataset to laptop and process, analyze, visualize Move end-user’s program to the data, only communicate
questions and answers
Problem statement
The technologies, strategies, methodologies, and resources needed to manage digital information have not kept pace with innovations in the creation and capture of digital information.
Current approaches do not scale to peta-scale data collections.
Typical analysis for mesoscale meteorologists
Compare model results to observational data
Research Domain: Archiving derived data products and temporally changing data products.
Archiving - saving “born-digital” content for future use and reuse
Derived data products - data products that are result of further processing of original raw data
Temporally changing data products - data that is continuously changing through regular additions streamed into archive Ad hoc actions taken by content creators, or In conjunction with workflow processes.
Approach: General data models, standardized metadata schemas, standard, highly modular system-level architecture (grid computing), well-accepted communication protocol
Our current research challenges are in:
Repository architecture Define technical architecture Build tools to acquire, use, store data
Predict repository use for provisioning physical infrastructure
Representation of temporal and procedural relationships
Provenance Automated metadata generation Snapshots of temporally changing data products
User access to personalworkspaceis throughLEAD portal
Early interface for sharing data
Creating structure in user’s archive that models their investigation steps
workflow
myLEAD agent
Product requests,Product registers,Notification msgs,
myLEAD server
Gatherdata
products
workflow
Run 12 hour forecast
(6 hrs to complete)
Analyzeresults
Based on analysis, gatherother products
Analyzeresults
Run 6 Hr forecast (3 hrs
to complete)
12 hrs
Decoderservice
Notifservice
Hurricane Ivan
SE OK quadrant
Vortice study 98-00
Input data sets
WRF output
Hurricane Ivan
SE OK quadrant
Vortice study 98-00
Workflow templates
150.nc
Input data sets
Hurricane Ivan
SE OK quadrant
Vortice study 98-00
ftp://storageserver.org/file1998o768
Bob’s workspace (Dec 04) Bob’s workspace (Feb 05) Bob’s workspace (Mar 05)
Physical data storage
Table of collection
Table of file
Table of User
Metadata Catalog
Experim-Dec04
Experim-Feb05
Experim-Dec04
Experim-Feb05
001.nc. . .
WRF output filesPublished results
Capturing process in the structure
Archiving derived and temporally changing data products
4 < reads < 1004 < writes < 100
Personal archive catalog
Runs on Teragrid HPC machines
Runs onTeragrid storage servers
DeeptiGreg Carolyn
Challenge: criteria for determining number of versions necessary to preserve meaningful sense of an object’s evolution over time.
Archiving derived and temporally changing data products
Estimate size of LEAD’s personal archive repository (for provisioning)
Canoncial workflow - single 12 hr forecast (10%)
Educational workflow - simple analysis (50%)
Ensemble workflow - multi-forecast run (5%)
Data access workload - “retrieve all data products for Katrina and store to my personal repository” (35%)
“Job” Type
Data products (p) read or written to
repository per node
Functional nodes (n) per workflow
Fraction of total users running this
kind of ‘job’ a t any one time
Workload distrib for Users=500
Canonical Workflow
4≤p ≤100 4≤n≤12 .10 50
Educational Workflow
p≈ 4 n≈ 4 .50 250
Ensemble Workflow
(4≤p ≤100) * 100 (4≤n≤12) * 100 .05 25
Dat a Access Workload
p= 30 (15 read, 15 )write
n = 1 .35 175
-- done in advance of any real users-- estimated number of users: 500
Estimating file usage distribution: base on arrival rates of LEAD observational data sources
Estimated resource needs of archival repository for 500 active users
Total sustained read/write bandwidth = 157.9 Mbps
Storage needs = 21.2 TB
I/O rate = 1,667.6 files read/written per min
Empirical validation of hypothesis often involves gathering information into mental model. How can we archiving system help?
ideas, thoughts, concepts, opinions,theories, frames, schema, viewpoints,
perspectives, values, beliefs…
Mental Models
diagrams, maps, illustrations, visualmetaphors, pictures, graphs, matrices,
schematics, icons, cartoons…
Result Models
• When sufficient information
gathered, scientist
synthesizes information into
knowledge that allows
acceptance or rejection of
hypothesis.
• Archiving system can
assemble info for synthesis
into knowledge.
Forecast workflow example
Steps:
-- select geospatial
region over which
forecast is to be run
-- use as parameter to
model (ARPS, WRF)
-- model generates
products
-- products visualized
Tracking investigation progress
MyLEAD offloads mundane work of
gathering, storing, and tracking data
products used during experimental
investigation.
These products provide keys to
construction of mental “results model.”
myLEADservice
db
Constructing a ‘result object’
Result object -- collection of key materials assembled during
workflow execution deemed important to decision making.
Selected derived data objects added to result object. Determining
what is important and what is not is a research challenge.
Simple Example. Suppose
1. Geospatial region selected as input to forecast model
2. Based on user’s role in evacuation decisions, then
3. System adds link to result object to display road maps and
population density maps based geospatial region.
When forecast model completes and user visually
examines model results, LEAD data subsystem
simultaneously pops up maps of population density
and transportation network over that area.
Shaves minutes off critical
decision-making
Key metrics used in experimental evaluation Query response time -- elapsed time
between time client issues request and when it receives response.
Scalability - gradually increase amount of work server must do to satisfy a requestAdd metadata for 1 file, 100, 1000,
10000Add 1, 100, 500, 1000 attributes to file
all at once, or one at a time.
Experiment environment
Client and server run on separate dual 2.0GHZ Opterons, 16GB Ram
Machines connected via Fibre Channel to a 3.5TB SAN Array (16 250GB SATA drives)
Gigabit Ethernet connection between machines
Linux Red Hat Enterprise
Test architecture and breakdown of measured system components
Testclient
myL
EA
D t
oolk
it
myLEAD server
OG
SA
-DA
I
myLEAD storedprocedures
mySQL database
tyr02* tyr03
* Acknowledgements to National Science Foundation Grant No. 0202048.
g ed
cba
f
Performance overhead of adding attributes to a metadata description
10 610110 110050
10 sec
Issue query with 166K result set. Examine where overheads lie.
Ongoing needs in use of GIS
Noesisontology
GEOQuery GUI
Resourcecatalog
myLEAD userinfo space
Query service
THREDDS,Opendap, LDMTHREDDS,
Opendap, LDMTHREDDS,
Opendap, CDM
Metadata in FGDC-basedLEAD metadata schema
Data in binary (often)
Extract temperatures for region from surface (METAR) data, generate shape fileMinnesota
map server