Upload
june-ryan
View
217
Download
1
Embed Size (px)
Citation preview
1
DAQ System Realization
DAQ Data Flow Review Sep. 11-12th, 2001
Niko NeufeldCERN, EP
Niko NEUFELDCERN, EP
2
Overview
• FEM/RU complex – How many?
• Readout network– How big? , Which Components?
• Level 2/3 farm & Sub-farm Controllers – How fast? , How many?
•Conclusions
Niko NEUFELDCERN, EP
3
Acronyms
• Readout Unit – RU
• Readout Network – RN
• Level 1 – L1
• Front-end Multiplexer – FEM
• Gigabit Ethernet – GigE
• More non-sense Acronyms - MNSA
• Sub-farm Controller – SFC
• Network Processor – NP
• SpecInt95 SI95 – benchmark obtained from a standard test-suite of applications normalised to the performance of a SPARCstation 10/40 (40MHz SuperSPARC) this machine takes 48h to run the suite
Niko NEUFELDCERN, EP
4
DAQ Architecture
Read - out Network (RN)
RU RU RU
6-15 GB/s
6-15 GB/s
50 MB/sVariable latency
L2 ~10 ms
L3 ~200 ms
Control &
Monitoring
LA
N
Read -out units (RU)
Timing&
FastControl
Level -0
Front - End Electronics
Level -1
VELO TRACK ECAL HCAL MUON RICH
LHCb Detector
L0
L1
Level 0Trigger
Level 1Trigger
40 MHz
1 MHz
40-100 kHz
Fixed latency
4.0 s
Variable latency <2 ms
Data
rates
40 TB/s
1 TB/s
1 MHz
Front End Links
Trigger Level 2 & 3Event Filter
SFC SFC
CPU
CPU
CPU
CPU
Sub - Farm Controllers (SFC)
Storage
Th
rott
le
Front -End Multiplexers (FEM)
Niko NEUFELDCERN, EP
5
Basic Parameters & Features: Recap
• L1 trigger rate 40 kHz upgradeable to 100 kHz
• Total raw data size from all L1 boards approximately 4 GB/s
• Asynchronous
• Push-through
• Strict separation between control and data paths
• Overflow avoidance via central throttling
Niko NEUFELDCERN, EP
6
FEM/RU Complex
Read - out Network (RN)
RU RU RU
6-15 GB/s
6-15 GB/s
50 MB/sVariable latency
L2 ~10 ms
L3 ~200 ms
Control &
Monitoring
LA
N
Read-out units (RU)
Timing&
FastControl
Level -0
Front - End Electronics
Level -1
VELO TRACK ECAL HCAL MUON RICH
LHCb Detector
L0
L1
Level 0
Trigger
Level 1Trigger
40 MHz
1 MHz
40-100 kHz
Fixed latency
4.0 s
Variable latency
<2 ms
Data
rates
40 TB/s
1 TB/s
1 MHz
Front End Links
Trigger Level 2 & 3Event Filter
SFC SFC
CPU
CPU
CPU
CPU
Sub - Farm Controllers (SFC)
Storage
Th
rott
le
Front-End Multiplexers (FEM)
Niko NEUFELDCERN, EP
7
The FEM/RU complex
• Transports data from L1 links to the Readout Network
• Performs some multiplexing
• Assigns destinations (SFC)
• Is segmented according to the partitioning scheme of LHCb
All number shown in this section are for a system which puts 40 MB/s onto the output of an RU
Niko NEUFELDCERN, EP
8
“Generic” Sub-detector:From L1 Links to the FEM/RU
1…
108L1
board
ou
t lin
ks
0…7
FEM FEMFEM
0…70…7
0…
25
total 373 links from L1 front-end
Niko NEUFELDCERN, EP
9
“Generic” Sub-detector:From FEM/L1 to the Readout Unit
1…60
L1/F
EM
ou
t lin
ks
RU RURU1…
30
Gig
E
Lin
k t
o
RN
total ~95 links to RN
Gig
E
Lin
k t
o
RN
Gig
E
Lin
k t
o
RN
1…
71…
71…
7
Niko NEUFELDCERN, EP
10
Sub-Detector FEM/RU System
•A “generic sub-”detector” from the Dataflow-system’s point of view is one of:VELO, IT, OT, RICH1, RICH2, SPD/PS, ECAL, HCAL, MUON, L0-Trigger, L1-Trigger, Readout Supervisor
•A “generic sub-”detector” has– 1…108 output links from Level 1– 0…25 Front-end Multiplexers with a
multiplexing factor between 2 and 4– 1…30 Readout Units, with a multiplexing
factor between 1 and 7
Niko NEUFELDCERN, EP
11
Building the FEM/RU System
• Baseline for the building block is a Network Processor based module, with 4 or 8 Gigabit Ethernet ports(each module consists of 1 or 2 mezzanine cards with 1 NP and 4 GigE ports and 1 carrier board)
• The multiplexing factor is thus between 1 and 7
• System design proceeds by1. fixing the output bandwidth from a RU2. optimising the number of 4-port carrier cards3. taking into account partitioning (2 partitions must
not share a RU)
Niko NEUFELDCERN, EP
12
FEM/RU System for 40 MB/s Output Bandwidth
Velo IT OT RICH1 RICH2 SPD/PS ECAL HCAL Muon Level-0 Level-1Readout
SupervisorTotal
L1 Boards 100 108 60 21 34 8 14 4 10 3 1 10 373Data Rate/L1 Board [MB/s] 2.4 3.9 17 11 11 28 9.6 17 7.2 10 10 10
Fragment Size/L1 Board [kB] 0.06 0.0975 0.425 0.275 0.275 0.7 0.24 0.425 0.18 0.25 0.25 0.25
Total Rate [MB/s] 240 421.2 1020 231 374 224 134.4 68 72 30 10 10 2835Target RU output BW [MB/s] 40 40 40 40 40 40 40 40 40 40 40 40
Target # RU Output Ports 6 10.53 25.5 5.775 9.35 5.6 3.36 1.7 1.8 0.75 0.25 0.25
Target Mux Factor 16.67 10.26 2.35 3.64 3.64 1.43 4.17 2.35 5.56 4.00 4.00 40.00
Mux Factor (FEM) 4 2 1 1 1 1 2 1 1 1 1 1
#FEMs 25 54 60 21 34 8 7 4 10 3 1 10
#Mezzanines (FEMs) 50 54 0 0 0 0 7 0 0 0 0 0
#Carrier Boards (FEM) 25 27 0 0 0 0 4 0 0 0 0 0Output BW/FEM [MB/s] 9.6 7.8 17 11 11 28 19.2 17 7.2 10 10 10
Mux Factor (RU) 4 5 2 3 3 1 2 2 5 4 4 1
#RU Outputs 7 11 30 7 12 8 4 2 2 1 1 10 95#Mezzanines (RUs) 14 22 30 7 12 8 4 2 4 2 2 10
#Carrier Boards (RU) 7 11 15 4 6 4 2 1 2 1 1 5Ouput BW/RU [MB/s] 38.4 39 34 33 33 28 38.4 34 36 40 40 10
Total Mux Factor 16 10 2 3 3 1 4 2 5 4 4 1
#Mezzanines 64 76 30 7 12 8 11 2 4 2 2 10 228
Niko NEUFELDCERN, EP
13
FEM/RU System for 60 MB/s Output Bandwidth
Velo IT OT RICH1 RICH2 SPD/PS ECAL HCAL Muon Level-0 Level-1Readout
SupervisorTotal
L1 Boards 100 108 60 21 34 8 14 4 10 3 1 10 373Data Rate/L1 Board [MB/s] 2.4 3.9 17 11 11 28 9.6 17 7.2 10 10 10Fragment Size/L1 Board [kB] 0.06 0.0975 0.425 0.275 0.275 0.7 0.24 0.425 0.18 0.25 0.25 0.25Total Rate [MB/s] 240 421.2 1020 231 374 224 134.4 68 72 30 10 10 2835Target RU output BW [MB/s] 60 60 60 60 60 60 60 60 60 60 60 60Target # RU Output Ports 4.00 7.02 17.00 3.85 6.23 3.73 2.24 1.13 1.20 0.50 0.17 0.17Target Mux Factor 25.00 15.38 3.53 5.45 5.45 2.14 6.25 3.53 8.33 6.00 6.00 60.00Mux Factor (FEM) 5 5 1 1 1 1 2 1 2 2 1 1#FEMs 20 22 60 21 34 8 7 4 5 2 1 10#Mezzanines (FEMs) 40 44 0 0 0 0 7 0 5 2 0 0#Carrier Boards (FEM) 20 22 0 0 0 0 4 0 3 1 0 0Output BW/FEM [MB/s] 12 19.5 17 11 11 28 19.2 17 14.4 20 10 10Mux Factor (RU) 5 3 3 5 5 2 3 3 4 3 1 1#RU Outputs 4 8 20 5 7 4 3 2 2 1 1 10 67#Mezzanines (RUs) 8 8 20 10 14 4 3 2 4 1 1 10#Carrier Boards (RU) 4 4 10 5 7 2 2 1 2 1 1 5Ouput BW/RU [MB/s] 60 58.5 51 55 55 56 57.6 51 57.6 60 10 10Total Mux Factor 25 15 3 5 5 2 6 3 8 6 1 1#Mezzanines 48 52 20 10 14 4 10 2 9 3 1 10 183
Niko NEUFELDCERN, EP
14
RU/FEM System Summary
• 373 L1 boards give a total of average data rate of 2835 MB/s at 40 kHz– average event size 71 kB
• Fixing the average output bandwidth on the RU link to 40(60) MB/s results in 95(67) output links to the Readout Network
• This number takes into account– partitioning at the level of sub-detectors– multiplexing factors up to 7– minimisation of NP carrying mezzanine cards, i.e.
cost (228/183 in total)
Niko NEUFELDCERN, EP
15
Readout Network
Read-out Network (RN) 6-15 GB/s
50 MB/sVariable latency
L2 ~10 ms
L3 ~200 ms
Control &
MonitoringTrigger Level 2 & 3Event Filter
SFC SFC
CPU
CPU
CPU
CPU
Sub - Farm Controllers (SFC)
Storage
Th
rott
le
6-15 GB/s
Data
rates
40 TB/s
1 TB/s
RU RU RU
LA
N
Read -out units (RU)
Timing&
FastControl
Level -0
Front - End Electronics
Level -1
VELO TRACK ECAL HCAL MUON RICH
LHCb Detector
L0
L1
Level 0
Trigger
Level 1Trigger
40 MHz
1 MHz
40-100 kHz
Fixed latency
4.0 s
Variable latency
<2 ms
1 MHz
Front End Links
Front -End Multiplexers (FEM)
Niko NEUFELDCERN, EP
16
The Readout Network
• Must connect ~95 RUs to ~100 SFCs
• Consists of point-to-point GigE links
• Uses a custom light-weight connection-less protocol on top of raw Ethernet frames
• Is asynchronous and relies on back-pressure (via flow-control) to avoid buffer-overflows
• Must be able to perform non-blocking switching at least up to O(10) GB/s
Is from the RU/SFC point of view just a ~ 100 x 100 port Gigabit Ethernet Switch
Niko NEUFELDCERN, EP
17
Building a large GigE Switch
• Monolithic switches of this size are still not very common and very expensive (but they do exist, e.g. from ALCATEL and CISCO)
• Medium size commercial switch (e. g. Foundry FastIron) with 120 GigE ports
• Small switches like our standard NP based module(8 GigE ports)
• Possibly future custom modules based on next generation NPs (up to 20 ports)
• Any building block has to fulfil the basic requirements: non-blocking, flow-control, full line-speed: – For our NP based module we know that it complies– For commercial switches this must be / has been tested
• If the requirements are met, the only criterion is the cost per usable port
Niko NEUFELDCERN, EP
18
Topology of the Switching Network
• All numbers in the following are based on aBanyan (= a fully connected, equal-size layer) network topology, assuming a maximum load of40 MB/s on each output from a RU
• We have seen (J-P. Dufey’s presentation), that one can do better (taking into account the uni-directional data-flow)
Niko NEUFELDCERN, EP
19
Evolution of relative costs
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
Required Bandwidth w.r.t. 4 GB/s
Rel
ativ
e C
ost
4x4 (2 NPs 1st Generation)
5x5 (1 NP 2nd Generation)
10x10 (2 NPs 2nd Generation)
60x60 (Foundry BigIron)
Niko NEUFELDCERN, EP
20
Number of elementary switching elements needed to go from 4 to 12
GB/s
0
50
100
150
200
250
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
Required Bandwidth w.r.t. 4 GB/s
Nu
mb
er
of
Ite
ms
4x4 Boards
5x5 Boards
10x10 Boards
60x60 Swiches200 modules 4 switches
Niko NEUFELDCERN, EP
21
Cost of usable port
For a Banyan 96 x 96 port system at 40 MB/s load– using Foundry BigIron: 4 switches needed
120 x 120 usable ports:2200$ (copper) or 2500$ (SX fibre) per port (list-price)
– using 4 x 4 NP based module: 92 modules needed 92 x 92 usable ports: ~ 2400$ (SX fibre) per port (estimate)
Note however:
• Big Iron does not seem to fulfil our requirements (flow control!)
• NP based system can do final event-building (see later)
Niko NEUFELDCERN, EP
22
Readout Network Summary
• Fairly large ~ 100 x 100 Gigabit Ethernet Switch – most likely not monolithic
• Need non-blocking, line-speed switching, flow-control and reasonably large buffers
• Optical connectors very much preferred (but price!)
• NP based modules fulfil all requirements
• Optimised topology saves in switch ports
• Ultimate decision will evidently be based on price per usable port (provided other requirements are met)
Niko NEUFELDCERN, EP
23
L2/L3 Farm
Read - out Network (RN)
RU RU RU
6-15 GB/s
6-15 GB/s
50 MB/sVariable latency
L2 ~10 ms
L3 ~200 ms
Control &
Monitoring
LA
N
Read - out units (RU)
Timing&
FastControl
Level -0
Front - End Electronics
Level -1
VELO TRACK ECAL HCAL MUON RICH
LHCb Detector
L0
L1
Level 0
Trigger
Level 1Trigger
40 MHz
1 MHz
40-100 kHz
Fixed latency
4.0 s
Variable latency
<2 ms
Data
rates
40 TB/s
1 TB/s
1 MHz
Front End Links
Trigger Level 2 & 3Event Filter
SFC SFC
CPU
CPU
CPU
CPU
Sub-Farm Controllers (SFC)
Storage
Th
rott
le
Front - End Multiplexers (FEM)
Niko NEUFELDCERN, EP
24
Event building and Level 2 & 3 Farm
• Data from the RN are delivered to a specific part of the Level 2 & 3 farm
• The entry point towards the RN is the Subfarm Controller (SFC) (RUs know only about SFCs)
• The SFC is also the gate-way to the Storage Controller(s) (SC).
• Immediately before or after an event enters the SFC, the final event building must be performed
• Load on the farm must be balanced
• The farm nodes must be controlled
Niko NEUFELDCERN, EP
25
Bird’s eye view of a sub-farm
Storage Controller
10–20 Subfarm nodes
Subfarm Controller
ControlsPC
Readout Network Main
Switch
Controls Network Aggregation Switch
Subfarm Aggregation
Switch
Up-link to CERN
Up-link to Controls Network
Niko NEUFELDCERN, EP
26
Anatomy of a SFC
“Server-like” PC
CPU
Memory
GigENIC
Local Bus PCI/Infinibus
GigENIC
100BaseT NIC
Readout Network
Subfarm Network
Controls Network
LocalBridge
~60 MB/s~0.5 MB/s
~60 MB/s~0.5 MB/s
This NIC could do the final event
building
Large buffer for load balancing
Not critical if Event Building done else-
where
LocalBridge
66/64b
33/32b
A server like this can be bought today for ~5 kCHF
Niko NEUFELDCERN, EP
27
Subfarm Node• is disk-less, network-booted
• needs 2 network interfaces for controls and data
• needs remote reset facility
• needs lots of memory and CPU power
• must be “cheap” in terms of:– price per MIPS– floor-space– cooling, power, maintenance
• possible physical realizations include:– rack-mounted (1U) servers– standard boxes, “pizza-boxes”– “naked motherboards” on a carrier board crate based– micro-server blades– etc.
Niko NEUFELDCERN, EP
28
Moore’s Law
608 SpecInt2000 (roughly ~ 60 SpecInt95)
Complete system (standard box) ~ 2400 CHF today!
Niko NEUFELDCERN, EP
29
Further Components of the L2/L3 Farm
• Storage Controller: 1 or more multi T-Byte disk servers with connection to the high band-width link to the permanent storage facility
• Controls and Sub-farm aggregation switches: Edge switches with typically 2 1000BaseT up-links and ~20 100BaseT links (these are already almost commodity items)
• Controls PC: Server PC to control an entire sub-farm – will run standard ECS/SCADA system. (if needed for performance reasons several Control PCs can share the control of a sub-farm)
Niko NEUFELDCERN, EP
30
Final Event Building
Concatenation of fragmentsfrom RUs to one event:
•Using the SFC CPU (sorting & memory copy)
•Using “smart” = programmable NICs (event-building done during DMA)
•Using a final stage of NP based modules as 4 to 4 event-builders
NP
-based
8
port m
od
ule
SFC CPU
Niko NEUFELDCERN, EP
31
Size of L2/L3 farm
• Assuming 10000 SI95 for L2, 25000 SI95 for L3 and 50000 for Reconstruction for results in~850 SI95 units per sub-farm
• Assuming 55 SI95 for a farm node 20 nodes per sub-farm (including a comfortable safety margin)
100 SFCs (500 kCHF), 100 edge switches (300 kCHF), 2000 farm-nodes (4000 kCHF) = 4.8 MCHF (total cost of farm) (TODAY!!!)
• These numbers are approximate and the demand for CPU will perhaps be higher but a high performing farm could be built today at reasonable cost
Niko NEUFELDCERN, EP
32
L2/L3 Farm Summary
•The L2/L3 farm is composed of sub-farms
•It maintains the separation between control and data network
•It consists of ~100 SFCs, strong in I/O, and ~O(2000) nodes, strong in CPU/memory and an aggregation switch per sub-farm
•It is scalable, hierarchically organised, uniform, hence easy to configure, control and monitor
Niko NEUFELDCERN, EP
33
Conclusions (1)
• The data flow system is based on Gigabit Ethernet, most likely over cheap multi-mode fibres (1000BaseSX)
• The data flow system consists of 3 main parts:1. FEM/RU complex, which consists of NP based
modules and multiplexes several L1 links to 1 output link from a Readout Unit
2. A Gigabit Ethernet Switch Fabric, most likely composed of several smaller sub-units
3. A large compute farm, decomposed into sub-farms, load-balanced by Sub-farm Controllers
Niko NEUFELDCERN, EP
34
Conclusions (2)
• FEM/RU complex will be built of NP-based modules, whose performance has been established to be largely sufficient
• The main switching network will be built in an optimised topology, using either commercial switches or NP-based modules, depending on cost and performance
• The Subfarm will be implemented from server PCs as Subfarm-Controllers and PC like farm-nodes, connected by moderate sized edge switches. All these components exist at reasonable prices already today