Data Handling for LHC: Plans and Reality

1

Data Handling for LHC:Plans and Reality

Tony CassLeader, Database Services Group

Information Technology Department

11th July 2012

2

• HEP, CERN, LHC and LHC Experiments• LHC Computing Challenge• The Technique

– In outline– In more detail

• Towards the Future• Summary

Outline

3




Outline

55

ATLASEmily Nurse 20

We are looking for rare events!

Higgs (mH=120 GeV) : 17 pb 750 events

70 billion pb 3 trillion events! ** N.B. only a very small fraction saved!

e.g. potentially ~1 Higgs in every 300 billion interactions!

number of events = Luminosity × Cross section2010 Luminosity: 45pb-1

7

~250x more events to date

22

So the four LHC Experiments…

23

… generate lots of data …

The accelerator generates 40 million particle collisions (events) every second at the centre of each of the four experiments’ detectors

24

… generate lots of data …reduced by online computers to

a few hundred “good” eventsper second.

Which are recorded on disk and magnetic tapeat 100-1,000 MegaBytes/sec ~15 PetaBytes per year for all four experiments• Current forecast ~ 23-25 PB / year, 100-120M files / year

– ~ 20-25K 1 TB tapes / year

• Archive will need to store 0.1 EB in 2014, ~1Billion files in 2015

0

10

20

30

40

50

60CASTOR data written, 01/01/2010 to 29/6/2012 (in PB)

USERNTOFNA61NA48LHCBCOMPASSCMSATLASAMSALICE

Z μμ

ATLAS Z μμ event from 2012 data with 25 reconstructed vertices

25




Outline

26

What is the technique?Break up a Massive Data Set …

27

What is the technique?… into lots of small pieces and distribute them around the world …

28

What is the technique?… analyse in parallel …

29

What is the technique?… gather the results …

30

What is the technique?… and discover the Higgs boson:

Nice result, but… … is it novel?

a

31

Is it Novel?Maybe not novel as such, but the implementation

is Terrascale computingthat is widely appreciated!

32




Outline

34

The Grid• Timely Technology!• The WLCG project

deployed to meet LHC computing needs.

• The EDG and EGEE projects organised development in Europe. (OSG and others in the US.)

35

Grid Middleware Basics• Compute Element

– Standard interface to local workload management systems (batch scheduler)

• Storage Element– Standard interface to local mass storage

systems• Resource Broker

– Tool to analyse user job requests (input data sets, cpu time, data output requirements) and route these to sites according to data and cpu time availability.

Many implementations of the basic principles:Globus, VDT, EDG/EGEE, NorduGrid, OSG

36

• Issue– Grid sites generally want to maintain a high average CPU

utilisation; easiest to do this if there is a local queue of work to select from when another job ends.

– Users are generally interested in turnround times as well as job throughput. Turnround is reduced if jobs are held centrally until a processing slot is known to be free at a target site.

• Solution: Pilot job frameworks.– Per-experiment code submits a job which chooses a work

unit to run from a per-experiment queue when it is allocated an execution slot at a site.

• Pilot job frameworks separate out– site responsibility for allocating CPU resources from– Experiment responsibility for allocating priority between

different research sub-groups.

Job Scheduling in Practice

36

… But note: Pilot job frameworks talk directly to the CEs and

we have moved away from a generic solution to one that

has a specific framework per VO (although these can be

shared in principle)

37

Data Issues• Reception and long-term storage• Delivery for processing and export• Distribution• Metadata distribution

1430MB/s

700MB/s 2600MB/s

700MB/s 420MB/s

(3600MB/s) (>4000MB/s)

Scheduled work only – and we need ability to support 2x for recovery!

38

(Mass) Storage Systems• After evaluation of commercial alternatives

in the late 1990s, two tape-capable Mass storage systems have been developed for HEP:– CASTOR: an integrated

mass storage system

– dCache: a disk pool manager thatinterfaces to multiple tape archives(Enstore @ FNAL, IBM’s TSM)

• dCache is also used a basic disk storage manager Tier2s along with the simpler DPM

39

A Word About Tape• Our data set may be massive, but…

<10K 10K-100K

100K-1M

1M-10M

10M-100M

100M-500M

500M-1G

1G-2G >2G0

5

10

15

20

25

30

35

CERN Archive file size distribution, in %

~195MB average only increasing slowly after LHC startup!

0 500 1000 1500 2000 25000

20000

40000

60000

80000

100000

120000Drive write performance, CASTOR tape format

(ANSI AUL)

IBM AULSUN AUL

file size (MB)

Writ

e sp

eed

(KB/

s)

Average write drive speed: < 40MB/s(cf native drive speeds: 120-160MB/s)Small increases with new drive generations

It is made up ofmany small files…

…which is bad fortape speeds:

40

Tape Drive EfficiencySo we have to change tape writing policy…

0 100 200 300 400 500 6000

20

40

60

80

100

120

140

Drive write performance, buffered vs non-buffered tape marks

CASTOR present (3sync/file)CASTOR new (1sync/file)CASTOR future (1 sync / 4GB)

file size, MB

spee

d, M

B/s

3 sync/file 1 sync/file 1 sync / 4GB0

20

40

60

80

100

120

Average drive performance (MB/s) for CERN Archive files

43

Storage vs Recall Efficiency

43

• Efficient data acceptance:– Have lots of input streams, spread across a

number of storage servers,– wait until the storage servers are ~full, and– write the data from each storage server to tape.– Result: data recorded at the same time is

scattered over many tapes.• How is the data read back?

– Generally, files grouped by time of creation.– How to optimise for this? Group files on to a

small number of tapes.• Ooops…

44

Keep users away from tape

44

45

CASTOR & EOS

47

Data Distribution• The LHC experiments need to distribute

millions of files between the different sites.

• The File Transfer System automates this – handling failures of the underlying

distribution technology (gridftp)– ensuring effective use of the bandwidth with

multiple streams, and– managing the bandwidth use

• ensuring ATLAS, say, is guaranteed 50% of the available bandwidth between two sites if there is data to transfer

48

Data Distribution• FTS uses the Storage Resource Manager as an

abstract interface to the different storage systems– A Good Idea™ but this is not (IMHO) a complete storage

abstraction layer and anyway cannot hide fundamental differences in approaches to MSS design• Lots of interest in the Amazon S3 interface these days; this

doesn’t try to do as much as SRM, but HEP should try to adopt de facto standards.

• Once you have distributed the data, a file catalogue is needed to record which files are available where.– LFC, the LCG File Catalogue was designed for this role

as a distributed catalogue to avoid a single point of failure, but other solutions are also used• And as many other services rely on CERN, the need for a

distributed catalogue is no longer (seen as…) so important.

49

Looking more widely — I

49

• Only a small subset of data distributed is actually used

• Experiments don’t know a priori which dataset will be popular– CMS has 8 orders magnitude in

access between most and least popular

Dynamic data replication: create copies of popular datasets at multiple sites.

50

Looking more widely — II

50Fibre cut during tests in 2009Capacity reduced, but alternative links took over

622

Mbi

ts/s

Desktops

CERNn.107 MIPSm Pbyte Robot

Universityn.106MIPSm Tbyte Robot

FNAL4.107 MIPS110 Tbyte

Robot

622 M

bits/s

N x

622

M

bits

/s

622Mbits/s

622 Mbits/s

Desktops

Desktops

MONARC2000

• Network capacity is readily available…• … and it is reliable:• So let’s simply copy data from another

site if it is not available locally– rather than recalling from tape or failing the

job.• Inter-connectedness is increasing with the

design of LHCOne to deliver (multi-) 10Gb links between Tier2s.

51

Metadata Distribution• Conditions data is needed to make sense of the

raw data from the experiments– Data on items such as temperatures, detector

voltages and gas compositions is needed to turn the ~100M Pixel image of the event into a meaningful description in terms of particles, tracks and momenta.

• This data is in an RDBMS, Oracle at CERN, and presents interesting distribution challenges– One cannot tightly couple databases across the

loosely coupled WLCG sites, for example…– Oracle streams technology improved to deliver the

necessary performance, and http caching systems developed to address need for cross-DBMS distribution.

row size = 100B row size = 500B row size = 1000B0

50001000015000200002500030000350004000045000

4600 2800 1700

37000

3000025000

40000 40000

34000

Average Streams Throughput

Oracle 10g Oracle 11gR2 Oracle 11g R2 (optimized)

LCR/

s

52

• Jobs submitted to sites depend on large, rapidly changing libraries of experiment specific code– Major problems ensue if updated code is not

distributed to every server across the grid (remember, there are x0,000 servers…)

– Shared filesystems can become a bottleneck if used as a distribution mechanism within a site.

• Approaches– Pilot job framework can check to see if the

execution host has the correct environment…– A global caching file system: CernVM-FS.

Job Execution Environment

52

2011

ATLAS Today: 22/1.8M filesATLAS Today: 921/115GB

53




Outline

54

• Learning from our mistakes– We have just completed a review of WLCG

operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown.

– Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation.

• Clouds

• Identity Management

Towards the Future

55




• Clouds


Towards the Future

56




• Clouds


Towards the Future

57

Integrating With The Cloud?

CentralTask

Queue

Site A

Site B

Site C

SharedImage

Repository(VMIC)

User

VO service

Instance requests

Commercial cloud

Payload pull

Image maintainer

Cloud bursting

Slid

e co

urte

sy o

f Ulri

ch S

chwi

cker

ath

58




• Clouds


Towards the Future

59




• Clouds


Towards the Future

60

Grid Middleware Basics• Compute Element

– Standard interface to local workload management systems (batch scheduler)

• Storage Element– Standard interface to local mass storage

systems• Resource Broker

– Tool to analyse user job requests (input data sets, cpu time, data output requirements) and route these to sites according to data and cpu time availability.

Many implementations of the basic principles:Globus, VDT, EDG/EGEE, NorduGrid, OSG

None of this works

without…

61

Trust!

62

One step beyond?

63




Outline

64

• WLCG has delivered the capability to manage and distribute the large volumes of data generated by the LHC experiments– and the excellent WLCG performance has

enabled physicists to deliver results rapidly.• HEP datasets may not be the most complex

or (any longer) massive, but in addressing the LHC computing challenges, the community has delivered– the world’s largest computing Grid,– practical solutions to requirements for large-

scale data storage, distribution and access, and– a global trust federation enabling world-wide

collaboration.

Summary

64

65

Thank You!

And thanks to Vlado Bahyl, German Cancio, Ian Bird, Jakob Blomer, Eva Dafonte Perez, Fabiola Gianotti, Frédéric Hemmer, Jan Iven, Alberto Pace and Romain Wartel of CERN, Elisa Lanciotti of PIC and K. De, T. Maeno, and S. Panitkin of ATLAS for various unattributed graphics and slides.

Documents

Data Handling for LHC: Plans and Reality