21
ExArch: Climate analytics on distributed exascale data archives Martin Juckes , V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner, D. Waliser The ExArch project seeks to develop a framework for the scientific interpretation of multi-model ensembles at the peta- and exa-scale. Specifically, a framework for evaluating the imminent Climate Model Intercomparison Project, Phase 5 (CMIP5) archive, which will be largest of its kind ever assembled in this domain and the Coordinated Regional Downscaling experiment (CORDEX), which will push even beyond CMIP5 in resolution, albeit on the regional scale.

ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Embed Size (px)

Citation preview

Page 1: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

ExArch: Climate analytics on distributed exascale data

archivesMartin Juckes, V. Balaji, B.N. Lawrence,

M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner, D. Waliser

The ExArch project seeks to develop a framework for the scientific interpretation of multi-model ensembles at the peta- and exa-scale. Specifically, a framework for evaluating the imminent Climate Model Intercomparison Project, Phase 5 (CMIP5) archive, which will be largest of its kind ever assembled in this domain and the Coordinated Regional Downscaling experiment (CORDEX), which will push even beyond CMIP5 in resolution, albeit on the regional scale.

Page 2: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

ExArch: management numbersStart: March 1st, 2011Duration: 39 monthsBudget: 1.44 million EurosEffort: 246 staff months

Page 3: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Global Organization for Earth System Science Portals (GO-ESSP) [unfunded] .... a federation of frameworks that can work together using agreed-upon standards

METAFOR (EU FP7): Detailed meta-data for

climate models

IS-ENES (EU FP7): Infrastructure support, including distributed data

archive.

ESG (DOE, SciDAC): access to: data, information, models, analysis tools, and

computational resources

CURATOR (NSF, NASA. NOAA): capable infrastructure for Earth system research and operations

The ExArch back-story

Page 4: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Model development CMIP3 Model development CMIP5

Evaluation

Downscaling

Impacts models

IPCC 4th Assessment Report (2005-7)

IPCC 5th Assessment Report (2012-14)

Evaluation

Downscaling

Impacts models

The climate assessment process

Page 5: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

CMIP5 and AR5: a brief organisational overview

Experiment design coordinated by Karl Taylor at PCMDI on behalf of WCRP

CMIP5 archive

A globally distributed archive, a federation of data centres and modelling centres will host the archive

Access to the global archive expected in April

Page 6: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

A schematic roadmap

Page 7: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Improving precision

Image STS41B-41-2347 was provided by the Earth Sciences and

Image Analysis Laboratory at NASA's Johnson Space Center.

NASA Graphics

Page 8: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Fig 1 of Delaney and Barga (2010)

Dealing with complexity

Improved accuracy requires representing a vast range of processes:complexity in model design;complexity in experimental design.

Page 9: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Sampling uncertainty•Uncertainty in initial conditions– We do not know the state of the ocean to sufficient accuracy•Uncertainty from natural variability – Some components of the system are chaotic, we can never predict what state they will be in beyond O(10 days)•Uncertainty in model design and configuration – Many choices need to be made to create a computable system. The multi-model ensemble is a “multi-metric, distance optimising Monte-Carlo approach”, where each major design decision is a trade-off between adopting approaches used by others or exploring new territory.

Page 10: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Overpeck et al. (2011), Science.

Page 11: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

CMIP5 CMIP6 CMIP7Year 2012 2017 2022Power factor 1 30 1000

200 357 647Resolution [km] 100 56 31Number of mesh points [millions] 3.2 18.1 108.4Ensemble size 200 357 647Number of variables 800 1068 1439Interval of 3-dimensional output (hours) 6 4 3Years simulated 90000 120170 161898Storage density 0.00002 0.00002 0.00002Archive size (Pb) (atmosphere) 5.31 143.42 3766.99

Npp

Npp

= Number of mesh points pole to pole

Ng = Total number of spatial mesh points = O(N

pp3)

Nv = Number of variables ~ √N

pp

Ne = Ensemble size ~ N

pp

Nt = Time steps per simulated year ~ N

pp

Ny = Years simulated per intercomparison ~ √N

pp.

Cost ~ Npp

6

O(40) decrease in storage density needed to bring this

estimate in line with Overpeck et al.

Page 12: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Some trends – ballpark figures

Change per year Change per decade

Data centre storage +60%

Data centre energy use +25%

Energy use/unit capacity -22% 10-fold decrease

2010 2020

Purchase cost/Tb 200 USD 3 USD

Operating power 10 W/Tb 1W/Tb

Electricity cost (UK) 90 GBP/MWh 120 GBP/MWh

Cost of 1Tb* 3 years 200 + 46 3 + 6

Size at constant funding 1Pb 27Pb

Analysis by Kryder and Soo Kim (2009) suggests hard drives will not be replaced by solid state or other new media before 2020.

Page 13: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Strategy InformaticsClimate Science

•2 Workshops•Interactions with GCOS, ESA and NASA•Governance structures•Accessibility

•Software management•Robust metadata•Query management•Near archive processing

•Quality assurance•Climate science diagnostics

Work Packages

Page 14: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

ProvenanceDealing with thousands of experiments with complex models in multiple configurations – need comprehensive and machine readable documentation of model configuration.

Accurate metadataAs meta-data becomes more complex, we need to develop systems to ensure that it is accurate:Downstream: auto-generate meta-data from model configurationsUpstream: prescribe meta-data and auto-generate model configurations

Page 15: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Query syntax

The query syntax is a key element of the communication between system components.

Resource managementBringing calculations to an exa-scale data archive is going to create a significant computational demand.

Page 16: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Many to many data distribution

Can we create a bit-torrent for data with access control?

Security, transparency, inter-operability

Particularly inter-operability with observational climate archives. Need to support multiple access routes:•OpenDAP – for climate scientists•OGC•Browser interface

Page 17: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Quality control

Standard diagnostics

Structured queriesE.g. Ensemble mean of all models with dynamic* vegetation.

*: “dynamic” vegetation in the climate modelling community refers to temporally evolving vegetation, as opposed to prescribed fixed vegetation.

Many climate data users want to look at standard indices (e.g. strength of El Nino) rather than the raw data.

With millions of files, quality control is very important. With hundreds of different variables it is also very hard.

Page 18: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Themes: Taking the computation to the dataThe Climate Data Operators (CDO) library is maintained by DKRZ.Designed to act on files containing gridded spatial data, exploiting variable and units naming conventions.Over 400 operators (cf Gray's law start with 20).

Objectives:(1) allow preliminary analysis to select data (e.g. days with cyclones affecting US mainland) to be carried out at the archive;(2) allow data reduction operations (e.g. area mean) to be carried out at the archive.

Later: take the diagnostic computation into the run-time???

Page 19: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Degrees of closeness to the data

In the room

Limited CPU resource

IntercontinentalRegional, on

backboneOn site (inside

firewall)

May be necessary if

using multiple archives

General purpose

resources

Restricted access rights

Page 20: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Themes: Governance and communication (brain to brain)

Efficient exploitation of complex archives is dependent on a range of standards with various governance procedures. Will these standards respond to the demands of exa-scale systems?

ExArch will: (1) support the extension of the METAFOR Common Information Model to cover regional climate models; (2) explore methods of ensuring validity of model documentation; (3) engage with the NetCDF CF conventions governance process; (4) encourage early planning by sharing documents and interlinking milestones, discourage late planning by getting everyone round the table (doesn't scale well).

Page 21: ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

The end