67
David Quarrie: The ATLAS Experiment The ATLAS Experiment David Quarrie LBNL

David Quarrie LBNL - Lawrence Berkeley National …madaras/atlas/Quarrie_NERSC.pdfDavid Quarrie LBNL David Quarrie: The ATLAS Experiment 2 Overview • The ATLAS Detector and Physics

Embed Size (px)

Citation preview

David Quarrie: The ATLAS Experiment

The ATLAS ExperimentDavid Quarrie

LBNL

David Quarrie: The ATLAS Experiment2

Overview

• The ATLAS Detector and Physics Goals• The ATLAS Collaboration and Management• The Trigger and Data Acquisition System• The Computing and Software• Software Deployment and Production• Reality Checks and Stress Testing• Summary

David Quarrie: The ATLAS Experiment

LHC • √s = 14 TeV (7 times higher than Tevatron/Fermilab) → search for new massive particles up to m ~ 5 TeV

• Ldesign = 1034 cm-2 s-1 (>102 higher than Tevatron/Fermilab)

→ search for rare processes with small σ (N = Lσ )

ALICE : heavy ions

ATLAS and CMS :pp, general purpose

27 km ring used fore+e- LEP machine in 1989-2000

Start : Summer 2007

pp

LHCb : pp, B-physics

David Quarrie: The ATLAS Experiment

The ATLAS physics goals

Search for the Standard Model Higgs boson over ~ 115 < mH < 1000 GeV

Search for physics beyond the SM (Supersymmetry, q/l compositeness, leptoquarks, W’/Z’, heavy q/l, Extra-dimensions, ….) up to the TeV-range

Precise measurements : -- W mass -- top mass, couplings and decay properties -- Higgs mass, spin, couplings (if Higgs found) -- B-physics (complementing LHCb): CP violation, rare decays, B0 oscillations -- QCD jet cross-section and αs

-- etc. …. Study of phase transition at high density from hadronic matter to plasma of deconfined quarks and gluons (complementing ALICE). Transition plasma → hadronic matter happened in universe ~ 10-5 s after Big Bang

Etc. etc. …..

David Quarrie: The ATLAS Experiment

Cross Sections and Production Rates

• Inelastic proton-proton reactions: 109 / s • bb pairs 5 106 / s • tt pairs 8 / s

• W → e ν 150 / s• Z → e e 15 / s

• Higgs (150 GeV) 0.2 / s• Gluino, Squarks (1 TeV) 0.03 / s

Rates for L = 1034 cm-2 s-1: (LHC)

LHC is a factory for: top-quarks, b-quarks, W, Z, ……. Higgs, ……

(The only problem: you have to detect them !)

David Quarrie: The ATLAS Experiment

The Underground Cavern at Pit-1 forthe ATLAS Detector

Length = 55 mWidth = 32 mHeight = 35 m

David Quarrie: The ATLAS Experiment

ATLASLength : ~ 46 m Radius : ~ 12 m Weight : ~ 7000 tons~ 108 electronic channels~ 3000 km of cables

• Tracking (|η|<2.5, B=2T) : -- Si pixels and strips -- Transition Radiation Detector (e/π separation)

• Calorimetry (|η|<5) : -- EM : Pb-LAr -- HAD: Fe/scintillator (central), Cu/W-LAr (fwd)

• Muon Spectrometer (|η|<2.7) : air-core toroids with muon chambers

ATLAS superimposed onthe 5 floors of building 40

David Quarrie: The ATLAS Experiment

H → ZZ → 4 l

e, µ

Z

e, µ

e, µ

e, µ mZ

Hg

g

tZ(*)

“Gold-plated” channel for Higgs discovery at LHC

Simulation of a H → µµ ee event in ATLAS

Signal expected in ATLASafter 1 year of LHC operation

Physics example

David Quarrie: The ATLAS Experiment

Inner Detector (ID)

The Inner Detector (ID) is organized into four sub-systems:

Pixels (0.8 108 channels)

Silicon Tracker (SCT) (6 106 channels)

Transition Radiation Tracker (TRT) (4 105 channels)

Common ID items

David Quarrie: The ATLAS Experiment

Pixels

All FE chips have been delivered(all tested, showing a yield of 82%)

The sensor production is finished for2 layers, and on time for 3 layers

The module production rate (with bump-bonding in 2 industries) has improved, on track for 3 layers intime

First completed disk (two layers of 24 modules each, with 2’200’000 channelsof electronics

ATLAS plans to have the Pixels operational for LHC start-up

The series production of finalstaves (barrel) and sectors (end-cap disks) has passed the 10% mark, this activity is now on the critical path of thePixel project

David Quarrie: The ATLAS Experiment

Inner Detector Progress Summary

Pixels: Steady ‘on-schedule’ progress on all aspects of the sub-system for 3 layers

SCT: Module mounting (‘macro-assembly’) on the 4 barrel cylinders ongoing (the first two cylinders are finished and tested, and one is at CERN)

The module mounting progressing on the forward disks (the first 8 disks are completed)

We have to recover a problem with LMTs (low mass tapes for the services)

TRT: Barrel module mounting into support structure is completed

End-cap wheel production is now also smooth, and the stacking at CERN into the end-cap structures is progressing

TRT barrel support with all modules

First complete SCT barrel cylinder

David Quarrie: The ATLAS Experiment

LAr and Tile Calorimeters

Tile barrel

Tile extended barrel

LAr forward calorimeter (FCAL)

LAr hadronic end-cap (HEC)

LAr EM end-cap (EMEC)

LAr EM barrel

David Quarrie: The ATLAS Experiment

LAr EM Barrel Calorimeter and Solenoid Commissioning at the Surface

The barrel EM calorimeter is installed in the cryostat, and after insertion of the solenoid, the cold vessel was closed and welded

A successful complete cold test (with LAr) was made during summer 2004 in hall 180

End of October the cryostat was transported to the pit, and lowered into the cavern

LAr barrel EM calorimeter after insertion into thecryostat

Solenoid just before insertion into the cryostat

David Quarrie: The ATLAS Experiment

David Quarrie: The ATLAS Experiment

Barrel Toroid coil transport and installation

David Quarrie: The ATLAS Experiment

The preparations for installation of the fifth BT coil in the cavern are well-advanced

The warm structure components production is nearing completion, matching the required schedule

David Quarrie: The ATLAS Experiment

ATLAS Collaboration

34 Countries151 Institutions1770 Scientific Authors

Albany, Alberta, NIKHEF Amsterdam, Ankara, LAPP Annecy, Argonne NL, Arizona, UT Arlington, Athens, NTU Athens, Baku, IFAE Barcelona, Belgrade, Bergen, Berkeley LBL and UC, Bern, Birmingham, Bonn, Boston, Brandeis, Bratislava/SAS Kosice,

Brookhaven NL, Bucharest, Cambridge, Carleton, Casablanca/Rabat, CERN, Chinese Cluster, Chicago, Clermont-Ferrand, Columbia, NBI Copenhagen, Cosenza, INP Cracow, FPNT Cracow, Dortmund, JINR Dubna, Duke, Frascati, Freiburg, Geneva,

Genoa, Glasgow, LPSC Grenoble, Technion Haifa, Hampton, Harvard, Heidelberg, Hiroshima, Hiroshima IT, Indiana, Innsbruck, Iowa SU, Irvine UC, Istanbul Bogazici, KEK, Kobe, Kyoto, Kyoto UE, Lancaster, Lecce, Lisbon LIP, Liverpool, Ljubljana,

QMW London, RHBNC London, UC London, Lund, UA Madrid, Mainz, Manchester, Mannheim, CPPM Marseille, Massachusetts, MIT, Melbourne, Michigan, Michigan SU, Milano, Minsk NAS, Minsk NCPHEP, Montreal, FIAN Moscow, ITEP Moscow, MEPhI Moscow, MSU Moscow, Munich LMU, MPI Munich, Nagasaki IAS, Naples, Naruto UE, New Mexico, Nijmegen,

BINP Novosibirsk, Ohio SU, Okayama, Oklahoma, LAL Orsay, Oslo, Oxford, Paris VI and VII, Pavia, Pennsylvania, Pisa, Pittsburgh, CAS Prague, CU Prague, TU Prague, IHEP Protvino, Ritsumeikan, UFRJ Rio de Janeiro, Rochester, Rome I, Rome II, Rome III,

Rutherford Appleton Laboratory, DAPNIA Saclay, Santa Cruz UC, Sheffield, Shinshu, Siegen, Simon Fraser Burnaby, Southern Methodist Dallas, NPI Petersburg, Stockholm, KTH Stockholm, Stony Brook, Sydney, AS Taipei, Tbilisi, Tel Aviv,

Thessaloniki, Tokyo ICEPP, Tokyo MU, Tokyo UAT, Toronto, TRIUMF, Tsukuba, Tufts, Udine, Uppsala, Urbana UI, Valencia, UBC Vancouver, Victoria, Washington, Weizmann Rehovot, Wisconsin, Wuppertal, Yale, Yerevan

David Quarrie: The ATLAS Experiment

ATLAS Appointments(March 2005)

ATLAS Plenary Meeting

Collaboration Board(Chair: S. BethkeDeputy: C. Oram)

Resources ReviewBoard

Spokesperson(P. Jenni

Deputies: F. Gianottiand S. Stapnes)

Technical Co-ordinator

(M. Nessi)

Resources Co-ordinator(M. Nordberg)

Executive Board

CB Chair AdvisoryGroup

Inner Detector(L. Rossi,

K. EinsweilerM. Tyndel, F. Dittus)

Tile Calorimeter(B. Stanek)

Magnet System(H. ten Kate)

ComputingCo-ordination

(D. Barberis,D. Quarrie)

ElectronicsCo-ordination

(P. Farthouat)

LAr Calorimeter(H. Oberlack,D. Fournier,J. Parsons)

Muon Instrum.(G. Mikenberg,

F. Taylor,S. Palestini)

Trigger/DAQ(C. Bee, N. Ellis,

L. Mapelli)

PhysicsCo-ordination

(G. Polesello)

AdditionalMembers(H. Gordon,A. Zaitsev)

David Quarrie: The ATLAS Experiment19

10-9 10-6 10-3

10-0 103 106 sec

25ns 3µshour year

ms

Reconstruction& Analyses TIER0/1/2

Centers

ON-line OFF-line

sec

10-2

100

102

104

106

108

QED

W,ZTopZ*

Higgs

10-4

Rate (Hz)

2 µs1 sec

10 ms

Level-1 Trigger 40 MHzHardware (ASIC, FPGA)Massive parallel ArchitecturePipelines

Level-2 Trigger ~75 kHzs/w PC farmLocale Reconstruction

Level-3 Trigger 1 kHzs/w PC farmFull Reconstruction

ATLAS Trigger

Event rate Event rate ➨➨

Level-2 Level-2 ➨➨

Level-1 Level-1 ➨➨

Offline Analyses Offline Analyses

Mass storage Mass storage ➨➨

David Quarrie: The ATLAS Experiment20

ATLAS Trigger Hierarchy

• ATLAS trigger comprises 3 levels– LVL1

• Custom electronics & ASICS, FPGAs• Max. time 2.5µs• Use of Calorimeter and Muon detector data• Reduce interaction rate to 75 kHz

– LVL2• Software trigger based on linux PC farm (~500 dual CPUs)• Mean processing time ~10 ms• Uses selected data from all detectors (Regions of Interest indicated by LVL1)• Reduces LVL1 rate to ~1 kHz

– Event Filter• Software trigger based on linux PC farm (~1600 dual CPUs)• Mean processing time ~1s• Full event & calibration data available• Reduces LVL2 rate to ~200Hz• Note – large fraction of HLT processor cost deferred initial running with

reduced computing capacity

David Quarrie: The ATLAS Experiment21

ATLAS Trigger & DAQ Architecture

H

L

T

DATAFLOW

40 MHz

75 kHz

~2 kHz

~ 200 Hz

Event Building N/workDataflow Manager

Sub-Farm InputEvent Builder EB

SFI

EBNDFMLvl2 acc = ~2 kHz

Event Filter N/work

Sub-Farm Output

Event FilterProcessors EFN

SFO

Event FilterEFP

EFPEFP

EFP

~ sec

~4 G

B/s

EFacc = ~0.2 kHz

Trigger DAQ

RoI BuilderL2 Supervisor

L2 N/workL2 Proc Unit

Read-Out Drivers

FE Pipelines

Read-Out Sub-systems

Read-Out Buffers

Read-Out Links

ROS

120 GB/s

ROB ROB ROB

LVL1

DET

R/O

2.5 µs

Calo MuTrCh Other detectors

Lvl1 acc = 75 kHz

40 MHz

RODRODROD

LVL2 ~ 10 ms

ROIB

L2P

L2SV

L2N

RoI

RoI data = 1-2%

RoI requests

specialized h/wASICsFPGA

120 GB/s

~ 300 MB/s

~2+4 GB/s

1 PB/s

David Quarrie: The ATLAS Experiment22

ATLAS Three Level Trigger Architecture

2.5 µs

~10 ms

~ sec.

• LVL1 decision made with calorimeter data with coarse granularity and muon trigger chambers data.

• Buffering on detector

• LVL2 uses Region of Interest data (ca. 2%) with full granularity and combines information from all detectors; performs fast rejection.

• Buffering in ROBs

• EventFilter refines the selection, can perform event reconstruction at full granularity using latest alignment and calibration data.

• Buffering in EB & EF

David Quarrie: The ATLAS Experiment23

RoI Mechanism

LVL2 uses Regions of Interest as identified by Level-1

• Local data reconstruction, analysis,and sub-detector matching of RoI data

LVL1 triggers on high pT objects

• Calorimeter cells and muon chambers to find e/γ/τ-jet-µ candidates above thresholds

The total amount of RoI data is minimal

• ~2% of the Level-1 throughput but it has to be accessed at 75 kHz

H →2e + 2µ

2e

David Quarrie: The ATLAS Experiment24

ATLAS Computing Characteristics

• Large, complex detector– ~108 channels

• Long lifetime– Project started in 1992, first data in 2007, last data 2027?

• 320 MB/sec raw data rate (x2 for processed and simulated data)– ~3 PB/year raw data

• Large, geographically dispersed collaboration– 1770 people, 151 institutions, 34 countries– Many are, and most will become, software developers

• Currently ~150FTE in offline software (~400 people)• Scale and complexity reflected in software

– ~1000 packages, ~7000 C++ classes, ~2M lines of code– ~70% code is algorithmic (written by physicists)– ~30% infrastructure, framework (written by software engineers)– Provide robustness but plan for evolution– Requires enabling technologies– Requires management & coherency

David Quarrie: The ATLAS Experiment25

Computing & Software Management

David Quarrie: The ATLAS Experiment26

Software Methodology

• Object-Oriented using C++ as programming language– Some wrapped FORTRAN and Java– Python as interactive & configuration language

• Heavy use of components behind abstract interfaces– Support multiple implementations– Robustness & evolution– Decoupling of dependencies

• Lightweight development process– Emphasis on automation and feedback rather than very formal process

• Previous attempt at developing a software system had failed due to a too rigorous software process decoupled from physicist developers

– Make it easy for developers to do the “right thing”– Some requirements/design reviews– Just completing 10 sub-system reviews

• 2 weeks each, 4-5 reviewers• Focus on client viewpoint and experience from DC2 (see later)• Feedback into planning process

David Quarrie: The ATLAS Experiment27

Simulated Data Processing

• Used to design detectors and trigger and to estimate how well reconstruction is being performed– Comparison with “truth”

• Generators– Creation of particles following a theoretical prediction of physics

• Simulation– Tracking of particles through detector material and magnetic field– Scattering and decays of particles

• Pile-up [optional]– Addition of multiple interactions per beam crossing and cavern

backgrounds (e.g. beam-gas, beam-halo interactions)• Digitization

– Folding in detector response to create electronic channel contents• Final data format identical to that actually produced by data

acquisition electronics (see later)– With the optional addition of “truth”

David Quarrie: The ATLAS Experiment28

Simulated Data Dataflow

David Quarrie: The ATLAS Experiment29

Primary Data Processing

• Raw data through Physics Analysis– Detector reconstruction

• Correction of non-linear detector & electronics response• Correction (and determination) of intra-detector mis-alignments• Local pattern recognition within sub-detector (e.g. track segment finding)

– Combined reconstruction• Combining results across detectors• Tentative particle identification & energy flow (e.g. jets)• Correction (and determination) of inter-detector mis-alignments

– Physics Analysis• Final particle identification • Physics hypotheses matching

• Online Trigger– Performance optimized reconstruction

• Online Monitoring & Calibration– Simplified detector performance monitoring– Determination of detector response & mis-alignments

David Quarrie: The ATLAS Experiment30

Control Framework

• Capture common behaviour for HEP processing– Processing stages

• Generation, simulation, digitization, reconstruction, analysis– Online (trigger & monitoring) & offline

• Control framework steers series of modules to perform transformations– Component based

– Dynamically reconfigurable

• Although framework captures common behaviour, it’s important to make it as flexible and extensible as possible

• Blackboard model– Algorithms register and retrieve data on shared blackboard– Component decoupling

• Athena Framework common project with LHCb– Both shared and ATLAS-specific components

David Quarrie: The ATLAS Experiment31

Athena Object Diagram

Converter

Algorithm

Event DataService

PersistencyService

DataFiles

AlgorithmAlgorithm

Transient Event Store

Detec. DataService

PersistencyService

DataFiles

Transient Detector Store

MessageService

JobOptionsService

Particle Prop.Service

OtherServices Histogram

ServicePersistency

ServiceDataFiles

TransientHistogram

Store

ApplicationManager

ConverterConverter

David Quarrie: The ATLAS Experiment32

Athena Components

• Algorithms– Provide basic per-event processing– Share a common interface (state machine)

• Tools– More specialized but more flexible than Algorithms

• Data Stores (blackboards)– Data registered by one Algorithm/Tool can be retrieved by another– Multiple stores handle different lifetimes (per event, per job, etc.)

• Services– E.g. Scripting, Random Numbers, Histogramming

• Converters– Transform data from one representation to another

• e.g. transient/persistent• Properties

– Adjustable parameters of components– Can be modified at run-time to configure job

David Quarrie: The ATLAS Experiment33

Data Access Model

• StoreGate provides the blackboard– Algorithms register data and downstream Algorithms retrieve – Multiple instances for different lifetimes– Manages transient/persistent conversion

• Handles user-defined types– Most objects (STL assignable) can be registered & retrieved– Keyed on (store, type, key) for multiple object instances– Optionally locks objects once registered to prevent modification– Provides iterators for wildcard retrieval

• Manages object ownership• Flexible container management

– Value containers (container owns objects)– View containers (support polymorphism)

• Inter-object links to support persistency– Support deferred access

• Referenced object isn’t read from disk until link is traversed

David Quarrie: The ATLAS Experiment34

Scripting

• Python used as both configuration and interactive scripting language

• Python bindings to C++ components provided using a introspection dictionary with both C++ and Python APIs– Database populated by parsing C++ header files using gccxml

– API based on that proposed for C++ language standard– Database and API also used for persistifying data objects

• Athena jobs configured by specifying the set of Algorithms & Services that are needed, as well as their Property overrides– History service records configuration and can be used for

“playback”

• Initially Python used as simple “data cards”, now being used as true OO language in order to simplify the user interface– Python objects map onto a sequence of C++ components, not just

one-to-one

David Quarrie: The ATLAS Experiment35

Code Repository

• CVS (Code Versioning System)– Subversion to be evaluated in future

• Packages grouped hierarchically for management– Container packages correspond both to CVS directory structure

and to logical groupings• Secure Network Access• Extensive of use of authorization for commit & tag access• Package tags to create snapshots• Set of tagged packages can be built into a release• Dependencies between packages managed by Code

Management Tool (CMT)– Ensures packages built in correct sequence– Also specifies components to be built (libraries, application, etc.)– Also specifies dependences on external packages

• ~40 external packages• Gaudi Framework (Athena kernel), LCG Apps Area, event generators,

Java support, online common, misc.

David Quarrie: The ATLAS Experiment36

Nightly Releases

• Complete software built every night on several platforms– Primarily Enterprise Linux 3 (RH7.3 just terminated)– More platforms (including AMD-64 and Mac OS X) underway

• Partial regression/unit tests• Problem reports emailed automatically to developers• 7 copies rotated so each lasts one week

– Allows more time for developers to fix problems

• Takes ~20 hours for full release (performed once per week)– Use incremental builds for other days

• Prototyping parallel builds– Package level parallelism using CMT build tool

• Takes advantage of multi-cpu computers– File level parallelism using distcc compiler

• Uses master with several slaves– Initial testing shows x3 speed up using file-level parallelism only

• Decomposition into multiple projects underway– Reduced build time per project– Better control over dependencies– More complicated build management

David Quarrie: The ATLAS Experiment37

Release Hierarchy

• Developer Releases– Every 3-4 weeks

– Subject to more management prior to build– Full regression tests

– Normally no attempt to fix problems after build is completed

• Production Releases– 2-3 times per year - synchronized with major milestones– Strict management (tag-approval) control

– Full regression tests– Iteration until immediate problems fixed

• Bug-fix Releases– In case of problems discovered after extended use

– Sometimes multiple bug-fix releases are necessary

David Quarrie: The ATLAS Experiment38

Release Management: Tag Collector

• Web-based API for specifying package versions within a release

• Release consists of a consistent set of packages & versions• Tag collector manages access rights• Auto-generates dependencies for container packages

– Packages specifying a group of child packages• E.g. Reconstruction

• Manages release sequence for decomposition into projects• Supports parallel development

– Some development in primary branch

– Other development into bug-fix branch

David Quarrie: The ATLAS Experiment39

Tag Collector

David Quarrie: The ATLAS Experiment40

NICOS

• System to manage primarily the nightly builds– Also builds summary web-pages for other release builds

• Performs CVS checkout, release builds, submission of automated tests, parses logfiles for errors– Sends emails to developers if errors detected

• Web-based browser to allow problems to be examined• Generates a web-page per release

David Quarrie: The ATLAS Experiment41

David Quarrie: The ATLAS Experiment42

Reality Checks

• Major milestones to test software and computer operations– Stress tests

• 1-2 per year• Data Challenges

– Production and processing of large simulated data samples

– Every 12-18 months

• Physics Workshops– Every 18-24 months– Major emphasis is exposure and feedback from physics community

• Test beams– Early use of offline software with real data using “vertical-slice” of

detectors and TDAQ hardware

David Quarrie: The ATLAS Experiment43

Data Challenges

• ATLAS has had 3 Data Challenges so far• Most recent (DC2) in 2nd half of 2004

– First large scale use of new C++ software

• Full Athena-based processing chain• Geant4 simulation engine• New persistency mechanisms for event and time-varying data

– Validate the computing model– Perform 10% test of Tier-0 (descoped)

• Pseudo real-time first-pass processing of raw data– Original scale 107 events

• Descoped because of delay to 106 events

– World-wide production• Using 3 Grid flavours (Grid3, LCG, NorduGrid)

David Quarrie: The ATLAS Experiment44

Physics Workshops

• Every 18-24 months• Rome Workshop held earlier this month• 450 physicists attended

– ~25% of ATLAS– A reminder that the software is not the end product

• World wide production– Used some of the lessons learned from DC2– Used expected ATLAS turn-on detector configuration– 8x106 events processed

• Important feedback on software usability, technical performance as well as physics performance

• Some problems but overall in pretty good shape– Software performance was side-comment to physics talks, not

major limiting factor

Towards the complete experiment: ATLAS combined test beam in 2004

Full “vertical slice” of ATLAS tested on CERN H8 beam line May-November 2004

x

z

y

Geant4 simulation of test-beam set-up

For the first time, all ATLAS sub-detectors integrated and run together with common DAQ, “final” electronics, slow-control, etc. Gained lot of global operation experience during ~ 6 month run. Common ATLAS software used to analyze the data

David Quarrie: The ATLAS Experiment

Test Beam

• Vertical detector slice (every detector subsystems represented)• Use of prototype TDAQ hardware & software• Use of Offline Software in Trigger and for Monitoring

– Also online calibrations

• Test of ability of software to deal with non-standard geometries– Geometry versioning management

– Non vertex-pointing tracking• Important later for commissioning with cosmics

• Test of reconstruction in non standard magnetic field• Exercise conditions database prototypes• Exercise mis-alignment determination and correction• Exercise data management software• Exercise development & release infrastructure

– Rapid turn-around but also robust

46

David Quarrie: The ATLAS Experiment

Tier2 Centre ~200kSI2k

Event Builder

Event Filter~7.5MSI2k

T0 ~5MSI2k

UK Regional Centre (RAL)

US Regional Centre

French Regional Centre

Dutch Regional Centre

SheffieldManchesterLiverpoolLancaster ~0.25TIPS

Workstations

10 GB/sec

320 MB/sec

100 - 1000 MB/s links

Castor

MSS

•Some data for calibration and monitoring to institutes

•Calibrations flow back

Each Tier 2 has ~20 physicists working on one or more channels

Each Tier 2 should have the full AOD, TAG & relevant Physics Group summary data

Tier 2 do bulk of simulation

Physics data cache

~Pb/sec

~ 75MB/s/T1 for ATLAS

MSS

Tier2 Centre ~200kSI2k

Tier2 Centre ~200kSI2k

≥622Mb/s links

Tier 0

Tier 1

Desktop

PC (2004) = ~1 kSpecInt2k

Northern Tier ~200kSI2k

Tier 2# ~200 Tb/year/T2

# ~2MSI2k/T1# ~2 Pb/year/T1

# ~5 Pb/year# No simulation

≥622Mb/s linksMSS MSS

10 Tier-1s reprocess

house simulation

Group Analysis

47

The Computing Model

David Quarrie: The ATLAS Experiment48

Service monitoring

Grid3 – participating sites

Sep 04•30 sites, multi-VO•shared resources•~3000 CPUs (shared)

David Quarrie: The ATLAS Experiment49

NorduGrid & Co. Participating sitesSite Country ~ # CPUs ~ % Dedicated

1 atlas.hpc.unimelb.edu.au 28 30%

2 genghis.hpc.unimelb.edu.au 90 20%

3 charm.hpc.unimelb.edu.au 20 100%

4 lheppc10.unibe.ch 12 100%

5 lxsrv9.lrz-muenchen.de 234 5%

6 atlas.fzk.de 884 5%

7 morpheus.dcgc.dk 18 100%

8 lscf.nbi.dk 32 50%

9 benedict.aau.dk 46 90%

10 fe10.dcsc.sdu.dk 644 1%

11 grid.uio.no 40 100%

12 fire.ii.uib.no 58 50%

13 grid.fi.uib.no 4 100%

14 hypatia.uio.no 100 60%

15 sigrid.lunarc.lu.se 100 30%

16 sg-access.pdc.kth.se 100 30%

17 hagrid.it.uu.se 100 30%

18 bluesmoke.nsc.liu.se 100 30%

19 ingrid.hpc2n.umu.se 100 30%

20 farm.hep.lu.se 60 60%

21 hive.unicc.chalmers.se 100 30%

22 brenta.ijs.si 50 100%

Totals:• 7 countries• 22 sites• ~3000 CPUs

– dedicated ~600• 7 Storage Services (in

RLS)– few more storage

facilities– ~12TB

• ~1FTE (1-3 persons) in charge of production

– 2-3 executor instances

David Quarrie: The ATLAS Experiment50

Production on 3 Grids

David Quarrie: The ATLAS Experiment

Country providing resourcesCountry anticipating joining

In LCG-2: 139 sites, 32 countries ~14,000 cpu ~5 PB storage

Includes non-EGEE sites:• 9 countries• 18 sites

LCG Computing Resources: May 2005

Number of sites is already at the scale expected for LHC

- demonstrates the full complexity of operations

David Quarrie: The ATLAS Experiment52

ATLAS Production system

LCG NG Grid3 LSF

LCGexe

LCGexe

NGexe

G3exe

LSFexe

super super super super super

prodDB dms

RLS RLS RLS

jabber jabber soap soap jabber

Don Quijote

Windmill

Lexor

AMI

CaponeDulcinea

David Quarrie: The ATLAS Experiment53

Jobs on Grid30%

16%

1%

7%

0%

12%

2%

4%

1%0%1%4% 1%

9%

4%

2%

19%

15%

0%

ANL_HEPBNL_ATLASBU_ATLAS_Tier2CalTech_PGFNAL_CMSIU_ATLAS_Tier2OU_OSCERPDSFPSU_Grid3Rice_Grid3SMU_Physics_ClusterUBuffalo_CCRUCSanDiego_PGUC_ATLAS_Tier2UFlorida_PGUM_ATLASUNM_HPCUTA_dpccUWMadison

19 sites~93000 jobs

30 N

ovem

ber 20

04

David Quarrie: The ATLAS Experiment54

Job Success Rate on GRID3

Finished Failed Success Rate

July 8799 6676 57%

August 17083 9448 64%

September 17283 7717 69%

October 26600 5186 84%

November 21869 5038 81%

David Quarrie: The ATLAS Experiment55

Jobs Total0%6%

0%2%

0%

4%

1%1%

0%0%0%1%

0%

3%

1%

1%

6%

5%

0%

5%

0%

3%

2%3%

1% 4% 4%1%

1%1%

1%0%0%

4%

0%2%

1%1%

2%

0%

3%

0%0%1%0%

4%

0%1%1%0%

1%0%

1%

1%0%

2%0%0%

3%

2%

1%1%

1%3%

1%1%0%0%0%

at.uibk ca.albertaca.montreal ca.torontoca.triumf ch.cerncz.cesnet cz.goliasde.fzk es.ifaees.ific es.uamfr.in2p3 it.cnafit.lnf it.lnlit.mi it.nait.roma it.tojp.icepp nl.nikhefpl.zeus tw.sinicauk.cam uk.lancsuk.man uk.pp.icuk.rl uk.shefuk.ucl au.melbournech.unibe de.fzkde.lrz-muenchen dk.aaudk.dcgc dk.nbidk.sdu no.uibno.grid.uio no.hypatia.uiose.hoc2n.umu se.it.uuse.lu se.lunarcse.nsc se.pdcse.unicc.chalmers si.ijsANL_HEP BNL_ATLASBU_ATLAS_Tier2 CalTech_PGFNAL_CMS IU_ATLAS_Tier2OU_OSCER PDSFPSU_Grid3 Rice_Grid3SMU_Physics_Cluster UBuffalo_CCRUCSanDiego_PG UC_ATLAS_Tier2UFlorida_PG UM_ATLASUNM_HPC UTA_dpccUWMadison

69 sites~276000 Jobs

30 N

ovem

ber 20

04

David Quarrie: The ATLAS Experiment56

Production Rate on 3 Grids

David Quarrie: The ATLAS Experiment57

Production Efficiency

0%

25%

50%

75%

100%

381813818238183381843818538186381873818838189381903819138192381933819438195381963819738198381993820038201382023820338204382053820638207382083820938210382113821238213382143821538216382173821838219382203822138222382233822438225382263822738228382293823038231382323823338234382353823638237382383823938240382413824238243382443824538246382473824838249382503825138252382533825438255382563825738258382593826038261382623826338264382653826638267382683826938270382713827238273382743827538276382773827838279382803828138282382833828438285382863828738288382893829038291382923829338294382953829638297382983829938300383013830238303383043830538306383073830838309383103831138312383133831438315383163831738318383193832038321383223832338324383253832638327383283832938330383313833238333383343833538336383373833838339383403834138342383433834438345383463834738348383493835038351383523835338354383553835638357383583835938360383613836238363383643836538366383673836838369383703837138372383733837438375383763837738378383793838038381383823838338384383853838638387383883838938390383913839238393383943839538396383973839838399384003840138402384033840438405384063840738408384093841038411384123841338414384153841638417384183841938420384213842238423384243842538426384273842838429384303843138432384333843438435384363843738438384393844038441384423844338444384453844638447384483844938450384513845238453384543845538456384573845838459384603846138462384633846438465384663846738468384693847038471384723847338474384753847638477384783847938480384813848238483384843848538486384873848838489384903849138492384933849438495384963849738498

Depends on many factors….

GRID3 used for most of the testing

for Rome production

NG had personel change between DC2 and Rome

David Quarrie: The ATLAS Experiment58

Feedback from Grid Deployment

• Simulation software very stable– E.g. No failures in 35k jobs over 3.5M events

• Major failure modes from access to input data files or failure to register output files

• Retry mechanisms put into place helped significantly– Some “good news, bad news” stories

• Error recovery (obviously) harder than error detection– Production management components had to be redesigned in

some places to provide adequate error recovery• Still a very manpower intensive activity• Scale of number of sites/nodes already reached for ATLAS turn-

on• 2nd generation of production tools being worked on• Next generation of Grid middleware also being deployed

David Quarrie: The ATLAS Experiment59

Computing System Commissioning

• Starts early in 2006 through to experiment turn-on in mid 2007• Detailed planning just started• 8 major sub-system tests

– Full software chain– Tier-0 scaling

• Pseudo-real time processing of data from Event Filter• Goal is <5 day latency

– Calibration & Alignment– Trigger Integration & Monitoring– Distributed Data Management

– Distributed Physics Analysis– Distributed Production

– TDAQ/Offline Full chain

• Completion of these corresponds to ATLAS turn-on

David Quarrie: The ATLAS Experiment60

CSC Acceptance Tests

• Detailed set of acceptance criteria for each test• Incorporated into automated tests• Establish functionality, technical performance and physics

performance thresholds• E.g. Acceptance criteria for Full Software Chain

– Validation of output for each stage by ability to read at next stage– Non-recoverable error rates– Event processing times

• Nominal 100 kSI2k sec/event for simulation– Currently x2-8 slower but development ongoing to meet goal

• Nominal 15 kSI2k sec/event for reconstruction– Currently x2 slow and again strategy in place to meet goal

– Memory usage• <1GB

– Job startup time– Etc.

David Quarrie: The ATLAS Experiment

Main Concerns

• Ability to deal with moving, inefficient detector• Apply lessons learned from Rome Workshop to physics analysis• Performance

– x2 required on reconstruction– x4-8 on simulation

• Improving ease of use– Distributed user support

• Establishing Tier-0 production• Grid production robustness• Coping with parallel detector commissioning activities• Establishing operations teams

– Shift crews plus long-term management staff

• Migrating from mode where emphasis is on rapid software development to one where emphasis is on robustness and validation

61

David Quarrie: The ATLAS Experiment

Overall summary installation schedule version 7.0(New baseline approved in the February 2005 ATLAS EB)

David Quarrie: The ATLAS Experiment63

NERSC HENPC Group

• Mixture of staff scientists, computer software engineers and post-docs (11 in total)– Mainly with degrees in physics but with subsequent training and

experience in computer science and software engineering

• Provide computing systems for large HEP and Nuclear Science experiments

• Leadership and architectural roles as well as core development• Generate institutional knowledge base• Leverage the coupling between NERSC and Physical Sciences at

LBNL• 5 Current projects

– ATLAS, BaBar, IceCube, Majorana, SNAP

• 5 members currently working on ATLAS (with new post-doc hire soon)– Paolo Calafiura, (Chris Day), Charles Leggett, Wim Lavrijsen,

Massimo Marino, David Quarrie, (Craig Tull)

David Quarrie: The ATLAS Experiment64

HENPC Group ATLAS Responsibilities

• Software Project Management• Chief Architect• Core Services Management within Software Project• Athena Framework• Data Access Model (StoreGate & EDM kernel)• Scripting & Interactivity• Aspects of introspection• Histogram, N-tuple, History & IOV Services• Pile-up & Event Mixing frameworks• Aspects of release build infrastructure• Usability Task Force• Tutorials & consultancy• Performance & diagnostic tools• Etc.

David Quarrie: The ATLAS Experiment65

Summary

• ATLAS experiment is highly complex– Multiple dimensions of scale

• Large number of detector channels, high data rate• Size and geographical dispersion of collaboration• Large developer base and large user base

• Long timescale– Plan for evolution

• Many problems are sociological rather than technical– Emphasis on enabling technologies and automated tests

• Good synergy between physicists, computer scientists and software engineers essential– LBNL & NERSC are good example of this

• Extensive stress tests underway and planned prior to startup– Feedback from most recent show that we’re on track

• Aside: ATLAS will be my 8th experiment turn-on– Each one worse than previous, despite additional experience

gained

David Quarrie: The ATLAS Experiment

Additional Material

66

David Quarrie: The ATLAS Experiment67

This is what we do