Archiving derived and temporally changing geospatial data in LEAD Beth Plale Department of Computer Science School of Informatics Indiana University

Archiving derived and temporally changing geospatial data in LEAD

Beth Plale

Department of Computer Science

School of Informatics

Indiana University

LEAD (Linked Environments for Atmospheric Discovery) dynamic, adaptive forecasting of mesoscale severe storms

GGF leveraged: Service-oriented architecture, moving to WSRF, WS-Notification,

service registry, Globus RLS, OGSA-DAI Beth Plale, IU

data subsystem architecture, myLEAD personal information space, “VO” catalog

Dennis Gannon, IU workflow (GBPEL), portal/science gateway, Teragrid, XSUL, notification

Oklahoma Univ -- mesoscale meteorology Unidata -- IDD, LDM NCSA -- brokering UNC (Reed) -- monitoring UAH -- data mining atmospheric data Millersville, Howard University -- 6-12 and UG educ NSF ATM-0331480

Resources

Access services

Resource services

personalWorkspace

browser

personalWorkspace

browser

Access interfaces

GeospatialQueryGUI

GeospatialQueryGUI

Askontology

Askontology

Viz Client(IDV)

Viz Client(IDV)

ResourceCatalog

VO data and compute resources

ResourceCatalog

VO data and compute resources

myLEADUser

Informationspace

myLEADUser

Informationspace

Noesis Ontology

concepts and vocabulary

Noesis Ontology

concepts and vocabulary

Query Service

querymediation

Query Service

querymediation

THREDDSCatalogs

-web browsermetadata

THREDDSCatalogs

-web browsermetadata

Name Service

-single global naming system

Name Service

-single global naming system

Automatedmetadata

generation- a capability

Automatedmetadata

generation- a capability

StreamService

- from LDMto user’s app

StreamService

- from LDMto user’s app

Steerableinstruments

- CASA

Steerableinstruments

- CASA

GridStorage

respository

GridStorage

respository

UnidataData dissemclient (LDM)

UnidataData dissemclient (LDM)

OPeNDAPdata

server

OPeNDAPdata

server

LEAD Data Subsystem Architecture

Petascale data collections increasingly crucial to research and education in science and engineering

Current influential technology factors: Powerful and affordable sensors, processors,

instruments, automated equipment Reductions in storage costs make cost-effective to

maintain large data collections Existence of Internet makes it easier to share data

As result, researchers increasingly conduct research using data originally generated by others. Genomics, climate modeling, demographic studies

Magnitude and breadth of

proliferation of data generation in US Same technological advances that produced

inexpensive digital cameras has enabled new generation of high resolution scientific instruments and sensors

Increasing amount of valuable content is “born digital” and can only be managed, preserved, and used in digital form. Advances in biomedical research depend on building and

preserving complex genomic databases. Research in biodiversity and ecosystems, global climate

change, meteorology, space science depend on abilty to combine vast quantities of digital information with complex models and analytical tools.

Problem Domain: storage, retrieval, access to petascale data collections in science and engineering

Digital data collections* are the foundation for analysis using automated analytical tools

Long-lived data undergoes constant re-analysis for improved algorithms or with alternate use in mind.

Analysis depends not just on sensed or computer-generated data but on the metadata that characterizes the environment and the sensing instrument.

*Data - text, numbers, images, video or movie clips, audio, software, algorithms, equations, models, simulations

*Digital data collections - data itself, and infrastructure, organizations needed to preserve access to the data.

Petascale data sets require new work style

Analysis tools growing more complex Many analysis algorithms are super-linear, often needing N2 or

N3 time to process N data points I/O bandwidth has not kept pace with storage capacity

Capacity increase 100-fold while storage bandwidth increase 10-fold

Too many files (> 1million) for a local file system to manage File name and directory hierarchy not enough

Can’t download dataset to laptop and process, analyze, visualize Move end-user’s program to the data, only communicate

questions and answers

Problem statement

The technologies, strategies, methodologies, and resources needed to manage digital information have not kept pace with innovations in the creation and capture of digital information.

Current approaches do not scale to peta-scale data collections.

Typical analysis for mesoscale meteorologists

Compare model results to observational data

Research Domain: Archiving derived data products and temporally changing data products.

Archiving - saving “born-digital” content for future use and reuse

Derived data products - data products that are result of further processing of original raw data

Temporally changing data products - data that is continuously changing through regular additions streamed into archive Ad hoc actions taken by content creators, or In conjunction with workflow processes.

Approach: General data models, standardized metadata schemas, standard, highly modular system-level architecture (grid computing), well-accepted communication protocol

Our current research challenges are in:

Repository architecture Define technical architecture Build tools to acquire, use, store data

Predict repository use for provisioning physical infrastructure

Representation of temporal and procedural relationships

Provenance Automated metadata generation Snapshots of temporally changing data products

User access to personalworkspaceis throughLEAD portal

Early interface for sharing data

Creating structure in user’s archive that models their investigation steps

workflow

myLEAD agent

Product requests,Product registers,Notification msgs,

myLEAD server

Gatherdata

products

workflow

Run 12 hour forecast

(6 hrs to complete)

Analyzeresults

Based on analysis, gatherother products

Analyzeresults

Run 6 Hr forecast (3 hrs

to complete)

12 hrs

Decoderservice

Notifservice

Hurricane Ivan

SE OK quadrant

Vortice study 98-00

Input data sets

WRF output

Hurricane Ivan

SE OK quadrant

Vortice study 98-00

Workflow templates

150.nc

Input data sets

Hurricane Ivan

SE OK quadrant

Vortice study 98-00

ftp://storageserver.org/file1998o768

Bob’s workspace (Dec 04) Bob’s workspace (Feb 05) Bob’s workspace (Mar 05)

Physical data storage

Table of collection

Table of file

Table of User

Metadata Catalog

Experim-Dec04

Experim-Feb05

Experim-Dec04

Experim-Feb05

001.nc. . .

WRF output filesPublished results

Capturing process in the structure

Archiving derived and temporally changing data products

4 < reads < 1004 < writes < 100

Personal archive catalog

Runs on Teragrid HPC machines

Runs onTeragrid storage servers

DeeptiGreg Carolyn

Challenge: criteria for determining number of versions necessary to preserve meaningful sense of an object’s evolution over time.

Archiving derived and temporally changing data products

Estimate size of LEAD’s personal archive repository (for provisioning)

Canoncial workflow - single 12 hr forecast (10%)

Educational workflow - simple analysis (50%)

Ensemble workflow - multi-forecast run (5%)

Data access workload - “retrieve all data products for Katrina and store to my personal repository” (35%)

“Job” Type

Data products (p) read or written to

repository per node

Functional nodes (n) per workflow

Fraction of total users running this

kind of ‘job’ a t any one time

Workload distrib for Users=500

Canonical Workflow

4≤p ≤100 4≤n≤12 .10 50

Educational Workflow

p≈ 4 n≈ 4 .50 250

Ensemble Workflow

(4≤p ≤100) * 100 (4≤n≤12) * 100 .05 25

Dat a Access Workload

p= 30 (15 read, 15 )write

n = 1 .35 175

-- done in advance of any real users-- estimated number of users: 500

Estimating file usage distribution: base on arrival rates of LEAD observational data sources

Estimated resource needs of archival repository for 500 active users

Total sustained read/write bandwidth = 157.9 Mbps

Storage needs = 21.2 TB

I/O rate = 1,667.6 files read/written per min

Empirical validation of hypothesis often involves gathering information into mental model. How can we archiving system help?

ideas, thoughts, concepts, opinions,theories, frames, schema, viewpoints,

perspectives, values, beliefs…

Mental Models

diagrams, maps, illustrations, visualmetaphors, pictures, graphs, matrices,

schematics, icons, cartoons…

Result Models

• When sufficient information

gathered, scientist

synthesizes information into

knowledge that allows

acceptance or rejection of

hypothesis.

• Archiving system can

assemble info for synthesis

into knowledge.

Forecast workflow example

Steps:

-- select geospatial

region over which

forecast is to be run

-- use as parameter to

model (ARPS, WRF)

-- model generates

products

-- products visualized

Tracking investigation progress

MyLEAD offloads mundane work of

gathering, storing, and tracking data

products used during experimental

investigation.

These products provide keys to

construction of mental “results model.”

myLEADservice

db

Constructing a ‘result object’

Result object -- collection of key materials assembled during

workflow execution deemed important to decision making.

Selected derived data objects added to result object. Determining

what is important and what is not is a research challenge.

Simple Example. Suppose

1. Geospatial region selected as input to forecast model

2. Based on user’s role in evacuation decisions, then

3. System adds link to result object to display road maps and

population density maps based geospatial region.

When forecast model completes and user visually

examines model results, LEAD data subsystem

simultaneously pops up maps of population density

and transportation network over that area.

Shaves minutes off critical

decision-making

Key metrics used in experimental evaluation Query response time -- elapsed time

between time client issues request and when it receives response.

Scalability - gradually increase amount of work server must do to satisfy a requestAdd metadata for 1 file, 100, 1000,

10000Add 1, 100, 500, 1000 attributes to file

all at once, or one at a time.

Experiment environment

Client and server run on separate dual 2.0GHZ Opterons, 16GB Ram

Machines connected via Fibre Channel to a 3.5TB SAN Array (16 250GB SATA drives)

Gigabit Ethernet connection between machines

Linux Red Hat Enterprise

Test architecture and breakdown of measured system components

Testclient

myL

EA

D t

oolk

it

myLEAD server

OG

SA

-DA

I

myLEAD storedprocedures

mySQL database

tyr02* tyr03

* Acknowledgements to National Science Foundation Grant No. 0202048.

g ed

cba

f

Performance overhead of adding attributes to a metadata description

10 610110 110050

10 sec

Issue query with 166K result set. Examine where overheads lie.

Ongoing needs in use of GIS

Noesisontology

GEOQuery GUI

Resourcecatalog

myLEAD userinfo space

Query service

THREDDS,Opendap, LDMTHREDDS,

Opendap, LDMTHREDDS,

Opendap, CDM

Metadata in FGDC-basedLEAD metadata schema

Data in binary (often)

Extract temperatures for region from surface (METAR) data, generate shape fileMinnesota

map server

Documents

Archiving derived and temporally changing geospatial data in LEAD Beth Plale Department of Computer Science School of Informatics Indiana University