50
Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) [email protected] APA Conference, Brussels, October 2014

Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) [email protected] APA Conference, Brussels, October 2014

Embed Size (px)

Citation preview

Page 1: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Data Preservation at the Exa-Scale and Beyond

Challenges of the Next Decade(s)

[email protected] Conference, Brussels, October 2014

Page 2: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

The Story So Far…

• Together, we have reached the point where a generic, multi-disciplinary, scalable e-i/s for LTDP is achievable – and will hopefully be funded

• Built on standards, certified via agreed procedures, using the “Cream of DP services”

• In parallel, Business Cases and Cost Models are increasingly understood, working closely with Projects, Communities and Funding Agencies

Page 3: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Topic For RDA-4 Joint DP W/S

• The high-level Use Cases we are being required to address by FAs are:

1. Open Access (specific samples / purposes);2. Reproducibility (of data, results, publications);3. Provision of Data Management Plan(s).

• AFAIK, these “requirements” are not specific to a given community, i.e. it is ALL disciplines funded by a given FA that must address these

Page 4: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Open Questions

• Long-term sustainability is still a technical issue– Let’s assume that we understand the Business Cases & Cost Models

well enough…– And (we) even have agreed funding for key aspects

• But can the service providers guarantee a multi-decade service?– Is this realistic?– Is this even desirable?

• I will address these issues at the APA conference next month in Brussels – with a proposal for “a solution”

Page 5: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

5

Background

• 20 years ago – in 1994 – the first Computing R&D projects for the LHC were proposed– About 10 years before the expected startup date

• History shows that these projects didn’t start too early – even including the LHC startup delays

• We now foresee “next generation” data factories in the 2020s and beyond

• These will generate Exabytes (e.g. HL-LHC) to Zettabytes (e.g. FCC, SKA) of data and last decades

Page 6: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

6

Technology(?)

• Of course, in 1-2 decades we can expect huge advances in technology

• At least some of these changes are likely to be disruptive – just look back!

• But you cannot plan based on the unknown• Eventually, you will have to make decisions based

either on what exists, or what you can be confident will be delivered, on the needed timescale

Major changes in technology during the active life of current / future projects likely

Page 7: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

7

H2020 EINFRA-1-2014

Managing, preserving and computing with big research data

7) Proof of concept and prototypes of data infrastructure-enabling software (e.g. for databases and data mining) for extremely large or highly heterogeneous data sets scaling to zetabytes and trillion of objects.

Clean slate approaches to data management targeting 2020+ 'data factory' requirements of research communities and large scale facilities (e.g. ESFRI projects) are encouraged

Page 8: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

8

Lunatic Fringe• But this is clearly the lunatic fringe. What exactly does

it have to do with me?

• Quite a lot: implications for efforts such as– CTRUST / RDA Certification Interest Group

• Can current certification procedures “scale” to such massive data volumes?• Can multi-site requirements be addressed?

– RDA Active Data Management Plans– 4C: Costs of Exa / Zetta scale curation must clearly be well understood and justified– RDA Reproducibility Interest Group (and many others)– DPINFRA: next generation requirements– [ Preservation VRE: some aspects ~independent of total data volume, some not ]– APA CoE– …

Significant economies of scale in “shared bit repositories”

Page 9: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

9

Suppose these guys can build / share the most cost effective, scalable andreliable federated storage services, e.g. for peta- / exa- / zetta- scalebit preservation? Can we ignore them?

Page 10: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Next Generation Data Factories• HL-LHC (https://indico.cern.ch/category/4863/)

– Europe’s top priority should be the exploitation of the full potential of the LHC, including the high-luminosity upgrade of the machine and detectors with a view to collecting ten times more data than in the initial design, by around 2030

– (European Strategy for Particle Physics)• SKA

– The Square Kilometre Array (SKA) project is an international effort to build the world’s largest radio telescope, with a square kilometre (one million square metres) of collecting area

Typified by SCALE in several dimensions:– Cost; longevity; data rates & volumes– Last decades; cost O(EUR 109); EB / ZB data volumes

10

Page 11: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

http://science.energy.gov/funding-opportunities/digital-data-management

/ • “The focus of this statement is sharing and preservation of digital research data”

• All proposals submitted to the Office of Science (after 1 October 2014) for research funding must include a Data Management Plan (DMP) that addresses the following requirements:

1. DMPs should describe whether and how data generated in the course of the proposed research will be shared and preserved.

If the plan is not to share and/or preserve certain data, then the plan must explain the basis of the decision (for example, cost/benefit considerations, other parameters of feasibility, scientific appropriateness, or limitations discussed in #4).

At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved. 11

Page 12: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

12

LHC experiments increasingly talking about:1. Open Access for Outreach;2. Reproducibility of Results.

Page 13: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

13

These are becoming mandatory activities, fully supported at all levels of the Collaborations

Page 14: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 14

Computing at the HL-LHC (~2025+)Predrag Buncic

on behalf of the Trigger/DAQ/Offline/Computing

Preparatory Group

ALICE: Pierre Vande Vyvre, Thorsten Kollegger, Predrag Buncic; ATLAS: David Rousseau, Benedetto Gorini, Nikos Konstantinidis; CMS: Wesley Smith, Christoph Schwick, Ian Fisk, Peter Elmer ; LHCb: Renaud Legac, Niko Neufeld

Page 15: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 15

ATLAS & CMS @ HL-LHCATLAS & CMS @ HL-LHC

10-20 GB/s

Storage

Level 1

HLT

5-10 kHz (2MB/event)

40 GB/s

Storage

Level 1

HLT

10 kHz (4MB/event)

PEAK OUTPUT

AKA “Filters”

Humungous Data RatesNot Relevant for LT DP

Page 16: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 16

Data: Outlook for HL-LHCData: Outlook for HL-LHC

• Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. • To be added: derived data (ESD, AOD), simulation, user data…

At least 0.5 EB / year (x 10 years of data taking)

PB

Run 1 Run 2 Run 3 Run 40.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

CMSATLASALICELHCb

We are here!

Page 17: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 17

Data storage issuesData storage issues

• Our data problems may still look small on the scale of storage needs of internet giants

• Business e-mail, video, music, smartphones, digital cameras generate more and more need for storage

• The cost of storage will probably continue to go down but…

• Commodity high capacity disks may start to look more like tapes, optimized for multimedia storage, sequential access

• Need to be combined with flash memory disks for fast random access

• The residual cost of disk servers will remain• While we might be able to write all this data, how long it will take to

read it back? Need for sophisticated parallel I/O and processing.

+ We have to store this amount of data every year and for many years to come (Long Term Data Preservation )

Page 18: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 18

WLCG Collaboration TodayWLCG Collaboration Today

• Distributed infrastructure of 150 computing centers in 40 countries• 300+ k CPU cores (~ 2M HEP-SPEC-06)• The biggest site with ~50k CPU cores, 12 T2 with 2-30k CPU cores• Distributed data, services and operation infrastructure

Page 19: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 19

WLCG Collaboration TomorrowWLCG Collaboration Tomorrow

• How will this evolve to HL-LHC needs?• To what extent is it applicable to other comparable scale projects?• Already evolving, most significantly during Long Shutdowns, but also

during data taking!

Page 20: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

20

Today’s state of the art in 0.1EB scale bit preservation(or “exabit”)

Page 21: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Bit-preservation WG one-slider• Mandate summary (see w3.hepix.org/bit-preservation)

– Collecting and sharing knowledge on bit preservation across HEP (and beyond)

– Provide technical advise to – Recommendations for sustainable archival storage in HEP

• Survey on Large HEP archive sites carried out and presented at last HEPiX– 19 sites; areas such as archive lifetime, reliability, access, verification,

migration– HEP Archiving has become a reality by fact rather than by design– Overall positive but lack of SLA’s, metrics, best practices, and long-term

costing impact

21

Page 22: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Ongoing Work

Two work areas:

1. Preparing a set of best-practice recommendations for bit-level preservation within HEP– ~10 recommendations– Concentrate more on “what” rather than “how” to do– Will be circulated to WG participants and surveyed sites summer time– Feedback will be most appreciated

2. Defining a simple and customisable model for helping establishing the long-term cost of bit-level preservation– Useful for site planning/outlook– Input for DPHEP – significant fraction of overall Data Preservation cost!– The rest of this presentation

22

Page 23: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Verification & reliability

• Systematic verification of archive data ongoing– “Cold” archive: Users only accessed ~20% of the

data (2013)– All “historic” data verified between 2010-2013– All new and repacked data being verified as well

• Data reliability significantly improved over last 5 years

– From annual bit loss rates of O(10-12) (2009) to O(10-16) (2012)

– New drive generations + less strain (HSM mounts, TM “hitchback”) + verification

– Differences between vendors getting small

• Still, room for improvement– Vendor quoted bit error rates: O(10-19..-20)– But, these only refer to media failures– Errors (eg bit flips) appearing in complete chain

~35 PB verified in 2014

No

loss

es

23

Page 24: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

“LHC Cost Model” (simplified)

Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year)

24

10EB

1EB

Page 25: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Case B) increasing archive growth

25

Page 26: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Total cost: ~$59.9M(~$2M / year)

Case B) increasing archive growth

26

Page 27: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

27

From Petabytes to Exabytes

• Can the current computing and data management models scale by orders of magnitude?

• We cannot simply “scale out” in terms of number of sites and need much greater resilience against data loss / corruption, including (semi-)automated recovery + support for adding / removing sites

• Today, this is often done by the experiments: how will this work after data taking stops?How will we cope when (not if) sites no longer “support” a given experiment?

Page 28: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

28

History shows that we will need

many years of R&D to reach a new scale.

Not all paths will be successfulbut we cannot postpone starting as the whole process,including the necessary servicehardening, will take many years. (Decade + ?)

Page 29: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Future Circular Collider (FCC)

Page 30: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Science caseConvince me that this project is scientifically excellent

Project PlanConvince me that you know what you are doing: scope, costs and schedule are under control

“Business case ” Convince me that this is a good use of public money

Page 31: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014
Page 32: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

What did the Tevatron@Fermilab cost?

• Tevatron accelerator– $120M (1983) = $277M (2012 $)

• Main Injector project– $290M (1994) = $450M (2012 $)

• Detectors and upgrades– Guess: 2 x $500M (collider detectors) + $300M (FT)

• Operations– Say 20 years at $100M/year = $2 billion

• Total cost = $4 billion

Page 33: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

PhD Student Training

• Value of a PhD student– $2.2M (US Census Bureau, 2002) = $2.8M (2012 $)

• Number of students trained at the Tevatron– 904 (CDF + DØ)– 492 (Fixed Target)– 18 (Smaller Collider experiments)– 1414 total

• Financial Impact = $3.96 billion

Page 34: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Superconducting Magnets• Current value of SC Magnet Industry

– $1.5 Billion p.a.• Value of MRI industry (the major customer for SC magnets)

– $5 Billion p.a.• This industry would probably have succeeded anyway –

what we can realistically claim is that the large scale investment in this technology at the Tevatron significantly accelerated its development– Guess – one to two years faster than otherwise?

• Financial Impact = $5-10 billion

Page 35: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

Balance sheet

• 20 year investment in Tevatron ~ $4B• Students $4B• Magnets and MRI $5-10B ~ $50B total• Computing $40B

Very rough calculation – but confirms our gut feeling that investment in fundamental science pays off

I think there is an opportunity for someone to repeat this exercise more rigorouslycf. STFC study of SRS Impacthttp://www.stfc.ac.uk/2428.aspx

}

Page 36: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014
Page 37: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014
Page 38: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014
Page 39: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

39

We have a good song tosing in terms of the scientific, economic and cultural benefitsof these next generation data factories.

Data sharing, Reproducibility and Measurable Data Management Plans are going to be key.

Page 40: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

40

Certification• Next generation data factories will bring new requirements in terms of

certification

• Multi-site certification can be expected to be core

• Today’s “best practices” will need to be extended – possibly rethought for this new scale

• Room for collaboration with peta- / exa-scale practitioners, e.g. those from HEPiX WG + RDA IG ???

Push key storage sites to pursue Certification in a coordinated fashion

Page 41: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

41

Data Management Plans

• Often these are “static” – revised at best every few years (and hence typically out of date with reality) – e.g. WLCG Technical Design Report

• Can we switch to a “dashboard mode”, whereby the current reality can be viewed, with the appropriate level of detail, through a portal?

• This is something that could “come naturally”, combining existing displays from data scrubbing, migration, caching and replication with Reproducibility & Outreach views: Tabs for Experts, FAs & GP

Page 42: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

42

We’re moving towards capturing the analysis environment so that Reproducibility is partof the Approval Process for Publication!

Page 43: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

43

CERN aims for 100% Gold Open Access for all its original HEP results, experimental and theoretical, by end 2016.

Page 44: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

44

Costs of Curation

• Given the scale, duration and expected costs of future generation data factories, a clear understanding of the costs and benefits of curation must be built in.

• The costs of “bit preservation” can clearly be reduced through economies of scale, but then not much further.– Is there any other way than “state of the art”?– Around $1M/year/EB in 2040+ !!!

The real issues relate to manpower intensive areas, such as knowledge capture and the ability to full re-use the data in the long-term.

Page 45: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

45

Reproducibility

• It is exciting to see such key issues being addressed from “grass root” initiatives, such as the recent RDA BoF in this area, with many experts involved!

– Leading hopefully to an Interest Group and concrete outcomes

– Maybe a “specific” call once mature?

• We have much to learn by sharing expertise and not repeatedly re-inventing wheels…

Page 46: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

http://science.energy.gov/funding-opportunities/digital-data-management

/ • “The focus of this statement is sharing and preservation of digital research data”

• All proposals submitted to the Office of Science (after 1 October 2014) for research funding must include a Data Management Plan (DMP) that addresses the following requirements:

1. DMPs should describe whether and how data generated in the course of the proposed research will be shared and preserved.

If the plan is not to share and/or preserve certain data, then the plan must explain the basis of the decision (for example, cost/benefit considerations, other parameters of feasibility, scientific appropriateness, or limitations discussed in #4).

At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved. 46

Page 47: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

47

Surely we can address these generic (scientific) requirementstogether, using at least some common services:

SCIDIP-ES outputs, CernVM[FS], Zenodo / Invenio, …

A joint VRE (R&D) proposal?

Page 48: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

2020 Vision for LT DP in HEP

• Long-term – e.g. FCC timescales: disruptive change

– By 2020, all archived data – e.g. that described in DPHEP Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further

– Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards

– DPHEP portal, through which data / tools accessed “HEP FAIRport”: Findable, Accessible, Interoperable, Re-usable

Agree with Funding Agencies clear targets & metrics48

Page 49: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

49

Summary

• Next generation data factories will bring with them many challenges for computing, networking and storage

• Data Preservation – and management in general – will be key to their success and must be an integral part of the projects: not an afterthought

• We need to start a range of R&D activities now: these can bring tangible benefits to existing projects in addition to preparing us for the future

Page 50: Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APA Conference, Brussels, October 2014

50