The DAQ/HLT system of the ATLAS experiment

The DAQ/HLT system of the ATLAS experiment

André Anjos

University of Wisconsin/Madison05 November 2008

On behalf of the ATLAS TDAQ collaboration

05 N

ovem

ber

2008

2

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Design characteristics Triggering is done in 3

levels

High-level triggers (L2 and EF) implemented in software. L2 is RoI-based

HLT/Offline software share components

Event size is about 1.6 Mbytes

Level-1 operates at 40 Mhz

Level-2 operates at 100 kHz

Event Filter operates at ~3 kHz

Storage records at 200 Hz

~ 3 kHz

~ 200 Hz

Read-Out System

05 N

ovem

ber

2008

3

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Detailed Overview

Dataflow

EBHigh LevelTrigger

L2

ROS

Level 1 Det. R/O

Trigger DAQ

2.5 s

O(10) ms

Calo MuTrCh Other detectors

Read-Out Systems

L2P

RoI

RoI requests

L2 accept (~3 kHz)

SFO

L1 accept

(100 kHz)

40 MHz

40 MHz

100 kHz

~3 kHz

~ 200 Hz

160 GB/s

~ 300 MB/s

~3+5 GB/s

EFEFP

O(1) sec

EF accept (~0.2 kHz)

ROD ROD ROD

ROB ROB ROB

SFI

DCN

Event Builder

EFN

DFM

L2SVROIB

Event Filter

Level 2RoI data (~2%)

Heavily multi-threaded DAQ applications

Event Processing is parallelized through MP

Commodity computing and networking

05 N

ovem

ber

2008

4

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Testing the Dataflow Make sure the current system can handle:

High-rates Oscillations Unforeseen problems (crashes, timeouts)

Testing conditions

HLT loaded with a 10^31 menu Mixed sample of simulated data

(background + signal) 4 L2 supervisors 2880 L2PUs (70% of final L2 farm size) 94 SFIs 310 EFDs + 2480 PTs (~20% of final sys.)

05 N

ovem

ber

2008

5

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Level-2 Able to sustain 60 kHz

through the system Able to handle

unforseen events Timings for event

processing at specification

1 second timeout,so very few lost messages

Almost 60 kHzinto L2 – 80% ofdesign rate

05 N

ovem

ber

2008

6

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Event Building

Aggregated Event Builder BW (MB/s)

Event Builder Rate (Hz)

LVL2-driven EB: 4.2 kHz (3.5 kHz)Small event size 800k (1.6M)Throughput ~3.5GB/s (5)

Limited by Event Filter capacity –

Only using 20% of final fa

rm

05 N

ovem

ber

2008

7

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Event Building PerformanceBuilding-only performanceextrapolated

Predicted EB+EF performancedegradation

Installed EF BWAvailable EF BW

Design

10^31 menutesting

Cosmics' 08

05 N

ovem

ber

2008

8

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Event Filter

10^31 menu

EFDSFI

PT

PT

PT

SFO

200 Hz

3 kHz

Multi-process approach1 EFD per

machine, multiple PTs

EFD/PT: Data communication through shared heaps

“Quasi-offline” reconstruction

Seeded by L2 SFIs work as “Event servers”: get data at 3 kHz Data is pushed to SFOs ~ 200 Hz

05 N

ovem

ber

2008

9

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Event Storage

ATLAS Throughput since August ~0.8PB

Final stage of TDAQ: data is written into files and asynchronously transferred to the mass storage

Streaming capabilities (express, physics, calibration, ...)

Farm of 5 nodes with a total storage area of 50TB, Raid-5

Provides sustained I/O rate of 550MB/s. Peak rate > 700MB/s (Target is 300MB/s)

Absorb fluctuations and spikes. Hot-spare capabilities

Fast recovery in case of mass storage failure

It is a component regularly used at the design specifications

Transfering about 1 TB/hour

during cosmics data taking

05 N

ovem

ber

2008

10

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Cosmics data taking

216 million events, with an average size of 2.1 MB = 453 TB 400,000 files HLT+DAQ problems tagged in about 2.5% of the total events (timeouts or crashes) 21 inclusive streams

05 N

ovem

ber

2008

11

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

ATLAS HLT & Multi-core End of “Frequency-scaling Era”:

Demand more parallelism to achieve expected throughput Event parallelism inherent to typical high energy physics

selection and reconstruction programs ATLAS has large code basis mostly written and designed in “pre-

multi-core era” Base line design for HLT:

Event Filter: Multi-processing since beginning: 1-2 secs processing time, needs 6000 cores to achieve 3

kHz Level-2: Multi-threading (but, may fallback to multi-processing)

@40 ms processing time need 4000 cores to achieve 100 kHz

We have explored: Multi-threading Multiple processes

05 N

ovem

ber

2008

12

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Level-2 & Multi-processing

When we started, understood that: Longer context switches Special data transfer mechanisms More applications to control More clients to configure More resources to monitor Dataflow more chaotic Most tools already in place

Apparently, lots of problems in many places!

05 N

ovem

ber

2008

13

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Level-2 & Multi-threading Multi-threading allows sharing the application

space Evident way to solve most of the problems

mentioned before Only problem: make HLT code thread-safe

Offline components shared But:

MT is not only about safeness, it is also about efficiency!

05 N

ovem

ber

2008

14

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Performance bottlenecks

• Level-2 is thread safe, but not efficient• At first, “nobody’s” direct fault!• What is the real issue?

Initialization Event processing Worker threads blocked

L2PU with 3 worker threads

Since we are importing code, it is difficult to keep track of what is “correctly” coded, what is not.

From release to release, something else broke or became less efficient.

Since we are importing code, it is difficult to keep track of what is “correctly” coded, what is not.

From release to release, something else broke or became less efficient.

And still…

Trigger rate as a function of # applications/node (Hz)

0

50

100

150200

250

300

350

0 2 4 6 8 10 12 14 16 18

processes

rate

[Hz]

One Machine with 8 cores in total

ATLAS has large code basis mostly written and designed in “pre-multi-core era” Which other packages hide goodies?

Synchronization problems are not fun to debug How to model software development so our hundreds of

developers can understand it and code efficiently? Current trends in OS development show improved context-switching

times and more tools for inter-process synchronization What if 1 thread crashes? MP Performance almost identical to MT. EF baseline is MP!

05 N

ovem

ber

2008

16

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Summary & Outlook on MP for HLT

Multi-threading, despite more powerful, lacking support tools and specialized developers

Base Offline infrastructure created in the “pre-multi-core era”: MT efficiency difficult in our case...

TDAQ infrastructure proven to work if use MP for HLT

Event processing scales well using MP

Techniques being investigated for sharing immutable (constant) data:

Common shared memory block

OS fork() + copy-on-write Importance understood, R&Ds being setup at CERN:

http://indico.cern.ch/conferenceDisplay.py?confId=28823

05 N

ovem

ber

2008

17

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Conclusions It works:

RoI mechanism

3-level trigger

Highly distributed system

HLT is currently MP based With simulated data:

L2 (70%) can sustain 60 kHz over many hours

EB can sustain design rate confortably

Processing times for HLT within design range With cosmics data:

Took nearly 1 Petabyte of detector data

Stable operations for months

We are ready to take new physics in 2009!

05 N

ovem

ber

2008

18

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

First event from beam

First day of LHC operationsDetection of beam dump at

collimator near ATLAS.

05 N

ovem

ber

2008

19

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Backup

05 N

ovem

ber

2008

20

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Level-2 L2-ROS communication current in UDP

2% * 1.6MB * 100 kHz = 3.2 MB/s (small!)

ROS design to stand a maximum 30 kHz hit rate

Each ROS is connected to a fixed detector location: Hot ROS effect

Level-1 Trigger

RoI Builder

L2SV

L2SV

L2SV

L2SV

L2PU

L2PU

L2PU

L2PU

DC Net

ROS ROS ROS ROS

100 kHz

10 kHz

100 Hz

fixed detector mapping

PIX SCT

TRT

LAR TIL MUON

Readout System

05 N

ovem

ber

2008

21

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Controls & Configuration Coordinates all the applications during data-taking:

First beam run included ~7000 applications on 1500 nodes

DAQ configuration DB accounts ~100000 objects

As soon as the HLT farm is complete we expect O(20k) applications distributed over 3000 nodes

Control software operates over the control network infrastructure Based on CORBA

communication library Decoupled from the

dataflow Some facilities provided: Distributed application

handling and configuration

Resource granting Expert system for

automatic recovery mechanisms

The same HW/SW infrastructure is also exploited by the monitoring services

05 N

ovem

ber

2008

22

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Multithreading: Compiler Comparison (vector, list, string)

Gcc 2.95 not valid, string not thread safe

Need technology tracking

Compilers

Debuggers

Performance assessment tools

05 N

ovem

ber

2008

23

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Infrastructure

First beam

Complex (and scalable) infrastructure needed to handle the system

File servers, boot servers, monitoring server

Security

User management (e.g. roles)

50 infrastructure nodes already installed

Will be > 100 in the final system

~1300 user allowed into the ATLAS on-line computing system

05 N

ovem

ber

2008

24

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

~ 3 kHz

~ 200 Hz

Read-Out System

Data Acquisition Strategy

Based on three trigger levels

LVL1 hardware trigger

LVL2: 500 1U PC farm

Reconstruction within Region of Interest (RoI) defined by LVL1

EF: 1900 1U PC farm

Complete event reconstruction

High Level Trigger (HLT)

05 N

ovem

ber

2008

25

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

HLT Hardware 850 PCs installed

8 cores

2 x Intel Harpertown 2.5GHz

16GB RAM

Single motherboard

Cold-swappable power supply

Network booted

2GbE on-board

1 for control and IPMI

1 for data

Double connection network to data-collection and back-end via VLAN (XPU)

Can act both as L2 and EF processors

Data-Collection

Back-end

05 N

ovem

ber

2008

26

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

HLT and Offline Software

HLTSSW

Steering Monitoring Service

1..*

MetaData Service

1..*ROB DataCollector

DataManager

HLTAlgorithms

Processing Task

Event DataModel

L2PU Application

<<import>>

Event DataModel

Reconstr. Algorithms

<<import>>

StoreGateAthena/Gaudi

<<import>><<import>>

Interface

Dependency

Package

Event Filter

HLT Core Software

Offline Core Software Offline Reconstruction

HLT Algorithms

Level2

HLT Data Flow Software HLT Selection Software Framework ATHENA/GAUDI Reuse offline components Common to Level-2 and EF

Offline algorithms used in EF

05 N

ovem

ber

2008

27

And

ré A

njos

, Uni

vers

ity

of W

isco

nsin

/Mad

ison

Multi-threading Performance

Standard Template Library (STL) and multi-threading L2PU: independent event processing in each worker thread Default STL memory allocation scheme (common memory pool) for containers is

inefficient for L2PU processing model → frequent locking L2PU processing model favors independent memory pools for each thread

Use pthread allocator/DF_ALLOCATOR in containers Solution for strings = avoid them

Needs changes in offline software + their external sw need to insert DF_ALLOCATOR in containers utility libraries need to be compiled with DF_ALLOCATOR design large containers to allocate memory once and reset data during event processing.

Evaluation of problem with gcc 3.4 and icc 8 Results with simple test programs ( were also used to understand original

findings) indicate considerable improvement (also for strings) in libraries shipped with new compilers

Insertion of special allocator in offline code may be avoided when new compilers are used

Initialization Event processing Worker threads blocked

L2PU with 3 worker threads

Documents

The DAQ/HLT system of the ATLAS experiment