Upload
arthur-hensley
View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC)
Philip J. Maechling ([email protected])Information Technology Architect
Southern California Earthquake CenterResearch and Data Access and Preservation Summit
Phoenix, Arizona9 April 2010
Interagency Working Group on Digital Data
(2009)
Consider the Digital Data Life Cycle
Can we Validate this Life Cycle Model against Digital Data Life Cycle Observations?
Digital Data Life Cycle Origination – Jan 2009
Digital Data Life Cycle Completion – Jan 2010
Notable Earthquakes in 2010
The SCEC Partnership
NationalPartners
InternationalPartners
CoreInstitutions
ParticipatingInstitutions
SCEC Member Institutions (November 1, 2009)
Core Institutions (16)
California Institute of TechnologyColumbia UniversityHarvard UniversityMassachusetts Institute of TechnologySan Diego State UniversityStanford UniversityU.S. Geological Survey, GoldenU.S. Geological Survey, Menlo ParkU.S. Geological Survey, PasadenaUniversity of California, Los AngelesUniversity of California, RiversideUniversity of California, San DiegoUniversity of California, Santa BarbaraUniversity of California, Santa CruzUniversity of Nevada, RenoUniversity of Southern California (lead)
Participating Institutions (53)
Appalachian State University; Arizona State University; Berkeley Geochron Center; Boston University; Brown University; Cal-Poly, Pomona; Cal-State, Long Beach; Cal-State, Fullerton; Cal-State, Northridge; Cal-State, San Bernardino; California Geological Survey; Carnegie Mellon University; Case Western Reserve University; CICESE (Mexico); Cornell University; Disaster Prevention Research Institute, Kyoto University (Japan); ETH (Switzerland); Georgia Tech; Institute of Earth Sciences of Academia Sinica (Taiwan); Earthquake Research Institute, University of Tokyo (Japan); Indiana University; Institute of Geological and Nuclear Sciences (New Zealand); Jet Propulsion Laboratory; Los Alamos National Laboratory; Lawrence Livermore National Laboratory; National Taiwan University (Taiwan); National Central University (Taiwan); Ohio State University; Oregon State University; Pennsylvania State University; Princeton University; Purdue University; Texas A&M University; University of Arizona; UC, Berkeley; UC, Davis; UC, Irvine; University of British Columbia (Canada); University of Cincinnati; University of Colorado; University of Massachusetts; University of Miami; University of Missouri-Columbia; University of Oklahoma; University of Oregon; University of Texas-El Paso; University of Utah; University of Western Ontario (Canada); University of Wisconsin; University of Wyoming; URS Corporation; Utah State University; Woods Hole Oceanographic Institution
Ground Motion Prediction
Unified Structural Representation
Fault Models
BlockModels
DeformationModels
EarthquakeRupture
Forecasts
SeismicHazard
Products
AnelasticStructures
AttenuationRelationships
EarthquakeRuptureModels
GroundMotion
Simulations
RiskMitigationProducts
Crustal Deformation Modeling
Fault & Rupture Mechanics
Earthquake Forecasting & Prediction
Seismic Hazard & Risk Analysis
TectonicEvolution &
B.C.s
Lithospheric Architecture & Dynamics
SCEC Earthquake System Models & Focus Groups
Southern California Earthquake Center• Involves more than 600 experts at over 60
institutions worldwide
• Focuses on earthquake system science using Southern California as a natural laboratory
• Translates basic research into practical products for earthquake risk reduction, contributing to NEHRP
SCEC Leadership Teams
Board of Directors
Staff
Planning Committee
Earthquakes are system-level phenomena… They emerge from complex, long-term interactions within active faults
systems that are opaque – thus are difficult to observe
They cascade as chaotic chain reactions through the natural and built environments – thus are difficult to predict
Anticipation time
month dayyeardecadecentury week
Faultrupture
Origintime
Response time
0 minute hour day year decade
------ Aftershocks -------------------------------------------------------------------
Surfacefaulting
Seismicshaking
Structural & nonstructuraldamage to built environment
Human casualties
Disease
Fires
Socioeconomic aftereffects
Landslides
Liquifaction
NucleationTectonic loading
Stress accumulation
Seafloordeformation
Tsunami
Dynamic triggering
Slow slip transients
Stress transfer
----- Foreshocks -----
Computational codes, structural models, and simulation results versioned with associated tests.
Development of new computational, data, and physical models.
Automated retrospective testing of forecast models using community defined validation problems.
Automated prospective performance evaluation of forecast models over time within collaborative forecast testing center.
External Seismic /Tsunami Models
Seismic Data Centers
HPC Resource Providers
Public and Governmental
Forecasts
Engineering and interdisciplinary
Research
Collaborative Research Project
Individual Research Project
Real-time Earthquake Monitoring
Discovery and access to digital
artifacts.
CME Platform and Data Management TAG
CME Platform and Data Administration System
Contribution and annotation of digital
artifacts.CME cyberinfrastructure supports a broad range of research computing with computational and data resources.
Programmable Interfaces
Future of solid earth computational science
Echo Cliffs PBR
Echo Cliffs PBR in the Santa Monica Mountains is >14m high and has a 3-4s free period. This rock withstood ground motions estimated at 0.2g and 12 cm/s during the Northridge earthquake. Such fragile geologic features give important constraints on PSHA.
Simulate Observed Earthquakes
Then, validate simulation model by comparing simulation results against observational data recorded by seismic sensors .
(red – simulation results,
black – observed data)
Simulate Potential Future Earthquakes
20SAN DIEGO SUPERCOMPUTER CENTER, UCSD
SCEC Roadmap to Petascale Earthquake Computing
2004
20122005
2008
2007 2011
2010
2009
2006
TeraShake1.x
ShakeOut 1.x
TeraShake2.x
ShakeOut 2.x
Chino Hills 1.x
M8 1.x
M8 2.x
M8 3.1
96% Parallel efficiency on 40K TJ Waterson BG/L cores.
BGW
First large wave propagation simulations of Mw7.7 earthquakes on the southern San Andreas with maximum frequency of 0.5Hz run using kinematic source descriptions based on the Denali earthquake. 240 SDSC DataStar cores used, 53 TBs outputs, largest simulation outputs recorded.
Simulations of Mw7.7 earthquakes in 2005-2006 using source descriptions generated by dynamic rupture simulations. The dynamic rupture simulations were based on Landers initial stress conditions, used 1024 NCSA TG cores.
Simulations of Mw7.8 with max frequency of 1.0Hz run using kinematic source descriptions based on geological observations.1920 TACC Lonestar cores.
Simulations of Mw7.8 earthquakes with max 1.0Hz using source descriptions generated by SGSN dynamic rupture simulations. The ShakeOut 2.x dynamic rupture simulations were constructed to produce final surface slip equivalent to the ShakeOut 1.x kinematic sources. 32K TACC Ranger cores used.
Comparison of simulated and recorded ground motions for 2009 Mw5.4 Chino Hills, two simulations were conducted using meshes extracted from CMU eTree database for CVM4 and CVM-H, 64K NICS Kraken cores used.
Simulations of Mw8.0 scenario on SAF from the Salton Sea to Parkfield ('Wall-to-Wall'), up to 1.0Hz. The source description was generated by combining several dynamic Mw7.8 dynamic source descriptions ('ShakeOut-D’). 96K NICS Kraken cores used.
40-m spacing and 435 billion mesh points, M8 2.x to run on 230K NCCS Jaguar cores, the world most powerful machine.
SciDAC OASCR Award
TeraGrid Viz Award
The most read article of year
15 Mio SUs, awarded, largest NSF TG allocation
INCITE allocations
M8 3.2
New model under development to deal with complex geometry, topography and non-planar fault surfaces.
Big 10
Simulaion of 9.0 Megaquake in Pacific Northwest
ShakeOut verification with 3 models
BG/L
TACC Ranger
ALCF BG/P
NICS Kraken
Improved source descriptions based Wave propagation simulation: dx=25m, Mw8.0, 2-Hz, 2,048 billion mesh points, 256x bigger than current runs
Dynamic rupture simulation, dx=5m (50 x 25 x 25km). Improve earthquake source descriptions by integrating more realistic friction laws into dynamic rupture simulations and computing at large scales including inner-scale of friction processes and outer-scale of large faults
SCEC: An NSF + USGS Research Center
Panel Questions
• What technical solutions exist that meet your academic project requirements?
• What requirements are unique to the academic environment?
• Are there common approaches for managing large-scale collections?
Simulation Results Versus Data
• Context of this workshop is Research Data Management.– I would like to communicate characteristics of the data management
complete perform seismic hazard computational research.
• I will refer to our simulation results as “data”– Some groups distinguish observational data from simulation results– This distinction becomes more difficult as observation and
simulation results are combined.
• For today’s presentation, I will focus on management of SCEC simulation results which may include both observational data and simulation results.
SCEC Storage Volume by Type
Estimated SCEC Data Archives (Total Current Archives ~ 1.4 PB)
SCEC Storage Elements (Files,Rows) by Type
Estimated SCEC Data Archives (Total Current Archives ~ 100M files, 600M rows)
Consider the Digital Data Life Cycle
Estimated SCEC Simulation Archives in Terabytes by Storage Location
• 2TB per SWF
• 6TB per RGT
• 2Hr per run
•10.4 M CPU-Hrs (650 runs, 3.6 Months on 4000 cores)
•400 - 600 TB
• 1 Hz body waves
• Up to 0.5 Hz Surface waves
Goal:
• 150 three-component stations [Nr]
• 200 earthquakes [Ns]
Sources & Receivers:
Costs:
• 200m, 1872 M mesh points
• 2min time series, 12000 time steps
Simulation parameters:
Data Management Context for SCEC
• Academic research groups responding to NSF proposals. Aggressive, large-scale, collaborative with need for transformative, innovative, original research (bigger, larger, faster)
• Data management tools and processes managed by heavily burdened academic staff
Data Management Context for SCEC
• Academic research very cost sensitive for new technologies
• HPC capabilities largely based on integrating existing cyberinfrastructure (CI) (not new CI development)
• Largely based on use of other peoples computers and storage systems (resulting in widely distributed archives)
Panel Questions
• What technical solutions exist that meet your academic project requirements?
• What requirements are unique to the academic environment?
• Are there common approaches for managing large-scale collections?
SCEC Milestone Capability RunsMilestone Runs TS1 TS2 DS2 SO1 SO2 CH50m W2W-1 CH15m* M8 W2W-3**
Machine SDSCDataStar
SDSCDataStar
NCSAIA-64
TACCLoneStar
TACCRanger
NICSKraken
NICSKraken
NICSKraken
NCCSJaguar
NCSABlue Water
Outer scale (km) 600 600 299 600 600 180 800 183 810 800
Inner (m) 200 200 100 100 100 50 100 15 40 25
Max Frequency 0.5 0.5 1.0 1.0 1.0 2.0 1 3.3 1.0 2.0
Min Surface Vel (m/s) 500 500 500 500 500 500 500 250 200 250
Mesh Points 1.8E+09 1.8E+09 9.6E+08 1.4E+10 1.4E+10 1.1E+10 3.1E+10 3.0E+11 4.4E+11 2.0E+12
Time Steps 22,768 22,768 13,637 45,456 50,000 80,000 60,346 100,000 120,000 320,000
Vel. Model Input (TB) 0.05 0.05 0.03 0.42 0.42 0.31 0.89 6.87 12.68 59.60
Storage w/o ckpt (TB) 53.0 10.0 9.5 0.5 0.5 1.9 0.3 66.4 39.9 400.0
Cores used 240 1,920 1,024 1,920 32,000 64,000 96,000 96,000 223,080 320K**
Wall-Clock-Time (hrs) 66.8 6.7 35.2 32.0 6.9 2.3 2.5 24 21.2 45**
Sustained TeraFlop/s 0.04 0.43 0.68 1.44 7.29 26.86 50.00 87.00 174.00 1,000**
* benchmarked, ** estimated
33SAN DIEGO SUPERCOMPUTER CENTER, UCSD
Data Transfer, Archive and Management
(Zhou et al., CSO’10)
Input/output data transfer between SDSC disk/HPSS to Ranger disk at the transfer rate up to 450 MB/s using Globus GridFTP
90k – 120k files per simulation, 150 TBs generated on Ranger, organized as a separate sub-collection in iRODs
Direct data transfer using iRODs from Ranger to SDSC SAM-QFS up to 177 MB/s using our data ingestion tool PIPUT
Sub-collections published through SCEC digital library (168 TB in size)
integrated through SCEC portal into seismic-oriented interaction environments
34
CyberShake Data Management Numbers
• CyberShake– 8.5 TB staged in (~700k
files) to TACC’s Ranger– 2.1 TB staged out (~36k
files) to SCEC storage– 190 million jobs
executed on the grid– 750,000 files stored in
RLS
CyberShake map
35
CyberShake Production Run - 2009• Run from 4/16/09 – 6/10/09• 223 sites
– Curve produced every 5.4 hrs
• 1207 hrs (92% uptime)– 4,420 cores on average– 14,540 peak (23% of Ranger)
• 192 million tasks– 44 tasks/sec– 3.8 million Condor jobs
• 192 million files– 11 TB output, 165 TB temp
36
Challenge: Millions of tasks
• Automation is key– Workflows with clustering
• Include all executions, staging, notification
– Job submission
• Data management– Millions of data files– Pegasus provides staging– Automated checks
• Correct number of files• NaN, zero-value checks• MD5 checksums
What is DAG-workflow
Jobs with dependencies organized in Directed Acyclic Graphs (DAG)
Large number of similar DAGs make up a workflow
37
GlobusWORLD 2003 The Globus View of Data Architecture 38
Virtual data language– Users define desired transformations– logical names for data and transformations
Virtual data catalog– Stores information about transformations,
derivations, logical inputs/outputs Query tool
– Retrieves necessary transformations given a description of them
– Gives an abstract workflow Pegasus
– Tool for executing abstract workflows on the grid
Virtual Data Toolkit (VDT): part of GriPhyN and iVDGL projects– Includes existing technology (Globus,
Condor) and experimental software (Chimera, Pegasus)
GriPhyN Virtual Data System
GriPhyN VDTReplica Catalog
DAGmanGlobus Toolkit, Etc.
Data Grid Resources(distributed execution
and data management)
VDL API/CLI(manipulate derivations
and transformations)
Virtual Data Catalog(implements ChimeraVirtual Data Schema)
Virtual DataApplications
Virtual Data Language
XML
ChimeraTask Graphs
(compute and datamovemment tasks,with dependencies)
GlobusWORLD 2003 The Globus View of Data Architecture 39
Functional View of Grid Data Management
Location based ondata attributes
Location of one ormore physical replicas
State of grid resources, performance measurements and predictions
Metadata Service
Application
Replica LocationService
Information Services
Planner:Data location, Replica selection,Selection of compute and storage nodes
Security and Policy
Executor:Initiates data transfers and computations
Data Movement
Data Access
Compute Resources Storage Resources
Panel Questions
• What technical solutions exist that meet your academic project requirements?
• What requirements are unique to the academic environment?
• Are there common approaches for managing large-scale collections?
SCEC: An NSF + USGS Research Center
Treat Simulation Data as Depreciating Asset
Simulation results differ from observational data.- Tends to be larger- Can be (often) recomputed- Often decreases in value with time- Less well-defined metadata
SCEC: An NSF + USGS Research Center
Collaborate with Existing Data Center
Avoid re-inventing Data Management Centers- (Re)-Train Observational data centers to manage
simulation data- Change the culture so deleting data is acceptable
SCEC: An NSF + USGS Research Center
Simulation Data as Depreciating Asset
Manage simulation results as depreciating asset:- Unique persistent ID’s for all sets- Track cost to produce, and cost to re-generate
for every data set
SCEC: An NSF + USGS Research Center
Simulation Data as Depreciating Asset
Responsibilities of researchers who want a lot of storage:
- Default storage lifetime is always limited- Longer term storage-based on community
use, community value, and readiness for use by community
- Burden on researchers for long term storage is more time adding metadata
SCEC: An NSF + USGS Research Center
Remove the Compute/Data Distinction
Compute models should always have associated verification and validation results and data sets should always have codes demonstrating access and usage.
Apply automated acceptance tests for all codes and access retrieval codes for all data sets.
SCEC: An NSF + USGS Research Center
Data Storage Entropy Resistance
Data sets will grow to fill storage- We recognize the need to encourage efficient
storage practices as routine
SCEC: An NSF + USGS Research Center
Data Storage Entropy Resistance
We are looking for data management tools that provide project management with tools to administer simulation results project-wide by providing information such as:
- Total Project and User Storage in use- Time since access for data- Understanding of backup and replicas
SCEC: An NSF + USGS Research Center
Metadata Strategies
Development of simulation metadata lead to extended effort with minimal value to geoscientists:
- Ontology development as basis for metadata not (yet?) shown significant value in field.
- Difficulty based on need to anticipate all possible future uses.
SCEC: An NSF + USGS Research Center
Controlled Vocabulary Tools
Controlled vocabulary management based on community-based wiki systems with subjects and terms used as tags in simulation data descriptions:
- Need tools for converting wiki, labels, and entries to relational database entries
- Need smooth integration between relational database (storing metadata) and wiki system
SCEC: An NSF + USGS Research Center
Metadata Strategies
Current simulation metadata based on practical uses cases:
- Metadata saved to support reproduction of data analysis described in publications.
- Metadata saved needed to re-run simulation.- Unanticipated future uses of simulation data often
not supported
End