Upload
kylee-vang
View
24
Download
0
Embed Size (px)
DESCRIPTION
André Anjos University of Wisconsin/Madison 05 November 2008 On behalf of the ATLAS TDAQ collaboration. The DAQ/HLT system of the ATLAS experiment. Read-Out System. ~ 3 kHz. ~ 200 Hz. Design characteristics. Triggering is done in 3 levels - PowerPoint PPT Presentation
Citation preview
The DAQ/HLT system of the ATLAS experiment
André Anjos
University of Wisconsin/Madison05 November 2008
On behalf of the ATLAS TDAQ collaboration
05 N
ovem
ber
2008
2
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Design characteristics Triggering is done in 3
levels
High-level triggers (L2 and EF) implemented in software. L2 is RoI-based
HLT/Offline software share components
Event size is about 1.6 Mbytes
Level-1 operates at 40 Mhz
Level-2 operates at 100 kHz
Event Filter operates at ~3 kHz
Storage records at 200 Hz
~ 3 kHz
~ 200 Hz
Read-Out System
05 N
ovem
ber
2008
3
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Detailed Overview
Dataflow
EBHigh LevelTrigger
L2
ROS
Level 1 Det. R/O
Trigger DAQ
2.5 s
O(10) ms
Calo MuTrCh Other detectors
Read-Out Systems
L2P
RoI
RoI requests
L2 accept (~3 kHz)
SFO
L1 accept
(100 kHz)
40 MHz
40 MHz
100 kHz
~3 kHz
~ 200 Hz
160 GB/s
~ 300 MB/s
~3+5 GB/s
EFEFP
O(1) sec
EF accept (~0.2 kHz)
ROD ROD ROD
ROB ROB ROB
SFI
DCN
Event Builder
EFN
DFM
L2SVROIB
Event Filter
Level 2RoI data (~2%)
Heavily multi-threaded DAQ applications
Event Processing is parallelized through MP
Commodity computing and networking
05 N
ovem
ber
2008
4
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Testing the Dataflow Make sure the current system can handle:
High-rates Oscillations Unforeseen problems (crashes, timeouts)
Testing conditions
HLT loaded with a 10^31 menu Mixed sample of simulated data
(background + signal) 4 L2 supervisors 2880 L2PUs (70% of final L2 farm size) 94 SFIs 310 EFDs + 2480 PTs (~20% of final sys.)
05 N
ovem
ber
2008
5
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Level-2 Able to sustain 60 kHz
through the system Able to handle
unforseen events Timings for event
processing at specification
1 second timeout,so very few lost messages
Almost 60 kHzinto L2 – 80% ofdesign rate
05 N
ovem
ber
2008
6
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Event Building
Aggregated Event Builder BW (MB/s)
Event Builder Rate (Hz)
LVL2-driven EB: 4.2 kHz (3.5 kHz)Small event size 800k (1.6M)Throughput ~3.5GB/s (5)
Limited by Event Filter capacity –
Only using 20% of final fa
rm
05 N
ovem
ber
2008
7
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Event Building PerformanceBuilding-only performanceextrapolated
Predicted EB+EF performancedegradation
Installed EF BWAvailable EF BW
Design
10^31 menutesting
Cosmics' 08
05 N
ovem
ber
2008
8
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Event Filter
10^31 menu
EFDSFI
PT
PT
PT
SFO
200 Hz
3 kHz
Multi-process approach1 EFD per
machine, multiple PTs
EFD/PT: Data communication through shared heaps
“Quasi-offline” reconstruction
Seeded by L2 SFIs work as “Event servers”: get data at 3 kHz Data is pushed to SFOs ~ 200 Hz
05 N
ovem
ber
2008
9
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Event Storage
ATLAS Throughput since August ~0.8PB
Final stage of TDAQ: data is written into files and asynchronously transferred to the mass storage
Streaming capabilities (express, physics, calibration, ...)
Farm of 5 nodes with a total storage area of 50TB, Raid-5
Provides sustained I/O rate of 550MB/s. Peak rate > 700MB/s (Target is 300MB/s)
Absorb fluctuations and spikes. Hot-spare capabilities
Fast recovery in case of mass storage failure
It is a component regularly used at the design specifications
Transfering about 1 TB/hour
during cosmics data taking
05 N
ovem
ber
2008
10
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Cosmics data taking
216 million events, with an average size of 2.1 MB = 453 TB 400,000 files HLT+DAQ problems tagged in about 2.5% of the total events (timeouts or crashes) 21 inclusive streams
05 N
ovem
ber
2008
11
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
ATLAS HLT & Multi-core End of “Frequency-scaling Era”:
Demand more parallelism to achieve expected throughput Event parallelism inherent to typical high energy physics
selection and reconstruction programs ATLAS has large code basis mostly written and designed in “pre-
multi-core era” Base line design for HLT:
Event Filter: Multi-processing since beginning: 1-2 secs processing time, needs 6000 cores to achieve 3
kHz Level-2: Multi-threading (but, may fallback to multi-processing)
@40 ms processing time need 4000 cores to achieve 100 kHz
We have explored: Multi-threading Multiple processes
05 N
ovem
ber
2008
12
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Level-2 & Multi-processing
When we started, understood that: Longer context switches Special data transfer mechanisms More applications to control More clients to configure More resources to monitor Dataflow more chaotic Most tools already in place
Apparently, lots of problems in many places!
05 N
ovem
ber
2008
13
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Level-2 & Multi-threading Multi-threading allows sharing the application
space Evident way to solve most of the problems
mentioned before Only problem: make HLT code thread-safe
Offline components shared But:
MT is not only about safeness, it is also about efficiency!
05 N
ovem
ber
2008
14
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Performance bottlenecks
• Level-2 is thread safe, but not efficient• At first, “nobody’s” direct fault!• What is the real issue?
Initialization Event processing Worker threads blocked
L2PU with 3 worker threads
Since we are importing code, it is difficult to keep track of what is “correctly” coded, what is not.
From release to release, something else broke or became less efficient.
Since we are importing code, it is difficult to keep track of what is “correctly” coded, what is not.
From release to release, something else broke or became less efficient.
And still…
Trigger rate as a function of # applications/node (Hz)
0
50
100
150200
250
300
350
0 2 4 6 8 10 12 14 16 18
processes
rate
[Hz]
One Machine with 8 cores in total
ATLAS has large code basis mostly written and designed in “pre-multi-core era” Which other packages hide goodies?
Synchronization problems are not fun to debug How to model software development so our hundreds of
developers can understand it and code efficiently? Current trends in OS development show improved context-switching
times and more tools for inter-process synchronization What if 1 thread crashes? MP Performance almost identical to MT. EF baseline is MP!
05 N
ovem
ber
2008
16
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Summary & Outlook on MP for HLT
Multi-threading, despite more powerful, lacking support tools and specialized developers
Base Offline infrastructure created in the “pre-multi-core era”: MT efficiency difficult in our case...
TDAQ infrastructure proven to work if use MP for HLT
Event processing scales well using MP
Techniques being investigated for sharing immutable (constant) data:
Common shared memory block
OS fork() + copy-on-write Importance understood, R&Ds being setup at CERN:
http://indico.cern.ch/conferenceDisplay.py?confId=28823
05 N
ovem
ber
2008
17
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Conclusions It works:
RoI mechanism
3-level trigger
Highly distributed system
HLT is currently MP based With simulated data:
L2 (70%) can sustain 60 kHz over many hours
EB can sustain design rate confortably
Processing times for HLT within design range With cosmics data:
Took nearly 1 Petabyte of detector data
Stable operations for months
We are ready to take new physics in 2009!
05 N
ovem
ber
2008
18
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
First event from beam
First day of LHC operationsDetection of beam dump at
collimator near ATLAS.
05 N
ovem
ber
2008
19
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Backup
05 N
ovem
ber
2008
20
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Level-2 L2-ROS communication current in UDP
2% * 1.6MB * 100 kHz = 3.2 MB/s (small!)
ROS design to stand a maximum 30 kHz hit rate
Each ROS is connected to a fixed detector location: Hot ROS effect
Level-1 Trigger
RoI Builder
L2SV
L2SV
L2SV
L2SV
L2PU
L2PU
L2PU
L2PU
DC Net
ROS ROS ROS ROS
100 kHz
10 kHz
100 Hz
fixed detector mapping
PIX SCT
TRT
LAR TIL MUON
Readout System
05 N
ovem
ber
2008
21
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Controls & Configuration Coordinates all the applications during data-taking:
First beam run included ~7000 applications on 1500 nodes
DAQ configuration DB accounts ~100000 objects
As soon as the HLT farm is complete we expect O(20k) applications distributed over 3000 nodes
Control software operates over the control network infrastructure Based on CORBA
communication library Decoupled from the
dataflow Some facilities provided: Distributed application
handling and configuration
Resource granting Expert system for
automatic recovery mechanisms
The same HW/SW infrastructure is also exploited by the monitoring services
05 N
ovem
ber
2008
22
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Multithreading: Compiler Comparison (vector, list, string)
Gcc 2.95 not valid, string not thread safe
Need technology tracking
Compilers
Debuggers
Performance assessment tools
05 N
ovem
ber
2008
23
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Infrastructure
First beam
Complex (and scalable) infrastructure needed to handle the system
File servers, boot servers, monitoring server
Security
User management (e.g. roles)
50 infrastructure nodes already installed
Will be > 100 in the final system
~1300 user allowed into the ATLAS on-line computing system
05 N
ovem
ber
2008
24
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
~ 3 kHz
~ 200 Hz
Read-Out System
Data Acquisition Strategy
Based on three trigger levels
LVL1 hardware trigger
LVL2: 500 1U PC farm
Reconstruction within Region of Interest (RoI) defined by LVL1
EF: 1900 1U PC farm
Complete event reconstruction
High Level Trigger (HLT)
05 N
ovem
ber
2008
25
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
HLT Hardware 850 PCs installed
8 cores
2 x Intel Harpertown 2.5GHz
16GB RAM
Single motherboard
Cold-swappable power supply
Network booted
2GbE on-board
1 for control and IPMI
1 for data
Double connection network to data-collection and back-end via VLAN (XPU)
Can act both as L2 and EF processors
Data-Collection
Back-end
05 N
ovem
ber
2008
26
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
HLT and Offline Software
HLTSSW
Steering Monitoring Service
1..*
MetaData Service
1..*ROB DataCollector
DataManager
HLTAlgorithms
Processing Task
Event DataModel
L2PU Application
<<import>>
Event DataModel
Reconstr. Algorithms
<<import>>
StoreGateAthena/Gaudi
<<import>><<import>>
Interface
Dependency
Package
Event Filter
HLT Core Software
Offline Core Software Offline Reconstruction
HLT Algorithms
Level2
HLT Data Flow Software HLT Selection Software Framework ATHENA/GAUDI Reuse offline components Common to Level-2 and EF
Offline algorithms used in EF
05 N
ovem
ber
2008
27
And
ré A
njos
, Uni
vers
ity
of W
isco
nsin
/Mad
ison
Multi-threading Performance
Standard Template Library (STL) and multi-threading L2PU: independent event processing in each worker thread Default STL memory allocation scheme (common memory pool) for containers is
inefficient for L2PU processing model → frequent locking L2PU processing model favors independent memory pools for each thread
Use pthread allocator/DF_ALLOCATOR in containers Solution for strings = avoid them
Needs changes in offline software + their external sw need to insert DF_ALLOCATOR in containers utility libraries need to be compiled with DF_ALLOCATOR design large containers to allocate memory once and reset data during event processing.
Evaluation of problem with gcc 3.4 and icc 8 Results with simple test programs ( were also used to understand original
findings) indicate considerable improvement (also for strings) in libraries shipped with new compilers
Insertion of special allocator in offline code may be avoided when new compilers are used
Initialization Event processing Worker threads blocked
L2PU with 3 worker threads