22
The LHC Computing Challenge Tim Bell Fabric Infrastructure & Operations Group Information Technology Department CERN 2 nd April 2009 1

The LHC Computing Challenge

  • Upload
    sheryl

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

The LHC Computing Challenge Tim Bell Fabric Infrastructure & Operations Group Information Technology Department CERN 2 nd April 2009. The Four LHC Experiments…. ATLAS General purpose Origin of mass Supersymmetry 2,000 scientists from 34 countries. CMS General purpose - PowerPoint PPT Presentation

Citation preview

Page 1: The  LHC Computing Challenge

1

The LHC Computing Challenge

Tim BellFabric Infrastructure & Operations Group

Information Technology Department

CERN

2nd April 2009

Page 2: The  LHC Computing Challenge

2

The Four LHC Experiments…ATLAS

- General purpose- Origin of mass- Supersymmetry- 2,000 scientists from 34

countries

CMS- General purpose

- Origin of mass- Supersymmetry

- 1,800 scientists from over 150 institutes

ALICE- heavy ion collisions, to create quark-gluon

plasmas- 50,000 particles in each collision

LHCb- to study the differences between matter and

antimatter- will detect over 100 million b and b-bar mesons

each year

Page 3: The  LHC Computing Challenge

3

… generate lots of data …

The accelerator generates 40 million particle collisions (events) every second at the centre of each of the four experiments’ detectors

Page 4: The  LHC Computing Challenge

4

… generate lots of data …reduced by online computers to

a few hundred “good” eventsper second.

Which are recorded on disk and magnetic tapeat 100-1,000 MegaBytes/sec ~15 PetaBytes per year for all four experiments

Page 5: The  LHC Computing Challenge

simulation

reconstruction

analysis

interactivephysicsanalysis

batchphysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreprocessing

eventreprocessing

eventsimulation

eventsimulation

analysis objects(extracted by physics topic)

Data Handling and Computation for

Physics Analysisevent filter(selection &

reconstruction)

event filter(selection &

reconstruction)

processeddata

CERN

Page 6: The  LHC Computing Challenge

6

CERN18%

All Tier-1s39%

All Tier-2s43%

CERN12%

All Tier-1s55%

All Tier-2s33%

CERN34%

All Tier-1s66%

Summary of Computing Resource RequirementsAll experiments - 2008From LCG TDR - June 2005

CERN All Tier-1s All Tier-2s TotalCPU (MSPECint2000s) 25 56 61 142Disk (PetaBytes) 7 31 19 57Tape (PetaBytes) 18 35 53

… leading to a high box count

CPU Disk Tape~2,500 PCs Another ~1,500 boxes

Page 7: The  LHC Computing Challenge

7

Computing Service Hierarchy

Tier-0 – the accelerator centre Data acquisition & initial processing Long-term data curation Distribution of data Tier-1 centres

Canada – Triumf (Vancouver)France – IN2P3 (Lyon)Germany – Forschunszentrum KarlsruheItaly – CNAF (Bologna)Netherlands – NIKHEF/SARA (Amsterdam)Nordic countries – distributed Tier-1

Spain – PIC (Barcelona)Taiwan – Academia SInica (Taipei)UK – CLRC (Oxford)US – FermiLab (Illinois) – Brookhaven (NY)

Tier-1 – “online” to the data acquisition process high availability

Managed Mass Storage Data-heavy analysis National, regional support

Tier-2 – ~100 centres in ~40 countries Simulation End-user analysis – batch and interactive

Page 8: The  LHC Computing Challenge

8

The Grid• Timely Technology!• Deploy to meet LHC

computing needs.• Challenges for the

WorldwideLHCComputingGrid Project due to– worldwide nature

• competing middleware…– newness of technology

• competing middleware…– scale– …

Page 9: The  LHC Computing Challenge

9

Interoperability in action

Page 10: The  LHC Computing Challenge

10

Reliability

Site ReliabilityTier-2 Sites

83 Tier-2 sites being monitored

Page 11: The  LHC Computing Challenge

11

• 1990s – Unix wars – 6 different Unix flavours

• Linux allowed all users to align behind a single OS which was low cost and dynamic

• Scientific Linux is based on Red Hat with extensions of key usability and performance features– AFS global file system– XFS high performance file system

• But how to deploy without proprietary tools?

Why Linux ?

See EDG/WP4 report on current technology (http://cern.ch/hep-proj-grid-fabric/Tools/DataGrid-04-TED-0101-3_0.pdf) or “Framework for Managing Grid-enabled Large Scale Computing Fabrics”(http:/cern.ch/quattor/documentation/poznanski-phd.pdf) for reviews of various packages.

Page 12: The  LHC Computing Challenge

12

• Commercial Management Suites– (Full) Linux support rare (5+ years ago…)– Much work needed to deal with specialist HEP

applications; insufficient reduction in staff costs to justify license fees.

• Scalability– 5,000+ machines to be reconfigured– 1,000+ new machines per year– Configuration change rate of 100s per day

Deployment

See EDG/WP4 report on current technology (http://cern.ch/hep-proj-grid-fabric/Tools/DataGrid-04-TED-0101-3_0.pdf) or “Framework for Managing Grid-enabled Large Scale Computing Fabrics”(http:/cern.ch/quattor/documentation/poznanski-phd.pdf) for reviews of various packages.

Page 13: The  LHC Computing Challenge

13

Dataflows and rates

Remember this figure

1430MB/s

700MB/s 1120MB/s

700MB/s 420MB/s

(1600MB/s) (2000MB/s)

Averages! Need to be able tosupport 2x for recovery!

Scheduled work only!

Page 14: The  LHC Computing Challenge

14

• 15PB/year. Peak rate to tape >2GB/s– 3 full SL8500 robots/year

• Requirement in first 5 years to reread all past data between runs– 60PB in 4 months: 6GB/s

• Can run drives at sustained 80MB/s– 75 drives flat out merely for controlled access

• Data Volume has interesting impact on choice of technology– Media use is advantageous: high-end

technology (3592, T10K) favoured over LTO.

Volumes & Rates

Page 15: The  LHC Computing Challenge

15

Castor Architecture

Tape Servers Tape

Dae

mon

Client

StagerJob

RTCPD

NameServer

VDQM

VMGR

Disk Servers

MoverM

over

RH

RR

Scheduler DBSvc

JobSvc

QrySvc

ErrorSvc

Stager

MigHunter

GC

RTCPClientD

DB

Detailed view

Central S

ervices

Disk cache subsystem

Tape archive subsystem

Page 16: The  LHC Computing Challenge

16

Castor Performance

16

Page 17: The  LHC Computing Challenge

17

• LEP, CERN’s last accelerator, started in 1989 and shutdown 10 years later.– First data recorded to IBM 3480s; at least 4

different technologies used over the period.– All data ever taken, right back to 1989,

was reprocessed and reanalysed in 2001/2.• LHC starts in 2007 and will run until at

least 2020.– What technologies will be in use in 2022 for

the final LHC reprocessing and reanalysis?• Data repacking required every 2-3 years.

– Time consuming– Data integrity must be maintained

Long lifetime

Page 18: The  LHC Computing Challenge

18

Disk capacity & I/O rates

1996 2000

1TB

20064GB

10MB/s50GB

20MB/s500GB60MB/s

I/O250x10MB/s

2,500MB/s20x20MB/s

400MB/s 2x60MB/s

120MB/s

CERN now purchases two different storage server models: capacity oriented and throughput oriented.

• fragmentation increases management complexity• (purchase overhead also increased…)

Page 19: The  LHC Computing Challenge

19

– Daily Backup volumes of around 18TB to 10 Linux TSM servers

.. and backup – TSM on Linux

Page 20: The  LHC Computing Challenge

20

Capacity Requirements

2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200

200

400

600

800

1000

1200

0

50

100

150

200

250

300

Predicted Growth in Offline Computing Re-quirements

CPUDiskTape

M S

I2K

or

Dis

k P

B

Tape P

B

Page 21: The  LHC Computing Challenge

21

Power Outlook

2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200

5

10

15

20

25

Predicted1 Growth in Electrical Power Demand

CPUDiskOther Services

MW

Page 22: The  LHC Computing Challenge

22

• Immense Challenges & Complexity– Data rates, developing software, lack of

standards, worldwide collaboration, …

• Considerable Progress in last ~5-6 years– WLCG service exists– Petabytes of data transferred

• But more data is coming in November…– Will the system cope with chaotic analysis?– Will we understand the system enough to

identify problems—and fix underlying causes ?– Can we meet requirements given power

available?

Summary

22