Upload
bonnie-hopkins
View
218
Download
0
Embed Size (px)
DESCRIPTION
Niko NEUFELD CERN, EP 3 Requirements Scalable up to several thousand CPUs Organised in sub-farms, which perform local load balancing, hide large numbers from RUs Support partitioning (at the subfarm level) Interface to the throttle via Experiment Control System (ECS) Must fit within cooling and space limits in UX8 Low latency for data movement and latency control Allow concurrent, seamless usage for L1 and HLT algorithms, running standard OS, while prioritising L1 traffic wherever possible
Citation preview
1
Farm Issues
L1&HLT Implementation ReviewNiko Neufeld, CERN-EP
Tuesday, April 29th
Niko NEUFELDCERN, EP
2
Overview
•Requirements•Architecture•Protocols & Dataflow•Latencies•Implementation
Niko NEUFELDCERN, EP
3
Requirements• Scalable up to several thousand CPUs• Organised in sub-farms, which perform local load
balancing, hide large numbers from RUs• Support partitioning (at the subfarm level)• Interface to the throttle via Experiment Control
System (ECS)• Must fit within cooling and space limits in UX8• Low latency for data movement and latency
control• Allow concurrent, seamless usage for L1 and HLT
algorithms, running standard OS, while prioritising L1 traffic wherever possible
Niko NEUFELDCERN, EP
4
The Event Filter Farm
MultiplexingLayer
FE FE FE FE FE FE FE FE FE FE FE FE
Switch Switch
NP NP NP
NP NP NP NP
SFC SFC SFC SFC SFC SFC
Farm CPUs ~1200 CPUs
Level-1Traffic
HLTTraffic
125-239Links
1.1 MHz8.8-16.9 GB/s
349Links
40 kHz2.3 GB/s
30 Switches
24 NPs77-135 NPs
77-135 Links6.4-13.6 GB/s
24 Links1.5 GB/s
73-140 Links7.9-15.1 GB/s
50-100 SFCs
37-70 NPs
Front-end Electronics
EventBuilder
Gb EthernetLevel-1 Traffic
Mixed TrafficHLT Traffic
50-100 Links
5.5-10 GB/s
TRM
Sorter
TFCSystem
Readout Network L1-Decision
Switch Switch Switch
StorageSystem
Niko NEUFELDCERN, EP
5
Architecture - Farm
NP NPEvent
Builder
Readout Network
Switch Switch Switch
NP NP NP
SFC
Switch
CPU
CPU
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPU
CPU
StorageController
ECS
Niko NEUFELDCERN, EP
6
Structure•The (initially) 1200 CPUs are distributed
over several sub-farms•To minimise the number of inputs from the
event building network, the number of sub-farms is chosen such that the average link load into a sub-farm is close to 110 MB/s– For the minimal system (VELO + TT), this yields
~ 50 subfarms– It is also advantageous to minimise the number
of sub-farms (while keeping the number of CPUs constant) from the point of view of the local load-balancing (see later)
Niko NEUFELDCERN, EP
7
Anatomy of a sub-farm• Each sub-farm consists of a gateway to the
event-builder, the Subfarm Controller SFC, and worker CPUs
• Each subfarm handles an aggregated data-stream of approximately two Gigabits (one in, one out) ( see later)
• The SFC is connected to the worker CPUs by a switch (Ethernet - Layer 2)
• A completely separate network connects the SFC and the worker nodes to the ECS
• The transport protocol is light-weight directly on top of Ethernet (or raw IP if necessary) no TCP
Niko NEUFELDCERN, EP
8
Dataflow• Completely assembled events are sent to the SFC as
(several) raw Ethernet frames• The SFC keeps a list of idle worker CPU and forwards
the event to a node– A node buffers only a single L1 event at anytime (latency!).
When no node is free, the event is buffered in the SFC and accumulates extra latency
– A node buffers several HLT events (50 to 100). When all buffers are full, events are buffered in the SFC
• When the high-water marks of the SFC buffer are reached, a throttle signal is issued via the ECS
• The worker CPU processes the event and always sends an answer (= decision)– In case of a L1 event the answer is only yes or no + a short
summary for the L1 sorter– In case of a HLT event the positive answer contains the raw
and reconstructed event data as well• The SFC forwards L1 decisions to the L1 decision
sorter and HLT accepted events to the storage controller
Niko NEUFELDCERN, EP
9
Latencies
Switch Switch Switch
SFC
Switch
CPU
CPU
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPU
CPU
Reception of event and invocation of trigger algorithm
Queuing in the SFC (“all all nodes are busy with a L1 event”)
Transmitting from switch to CPU node via Ethernet
Forwarding in the subfarm switch
Niko NEUFELDCERN, EP
10
Minimising the Latencies
•Transmitting of events (transport time)– use Gigabit Ethernet for the internal subfarm
network (although links loaded only to a few %)•Reception of events and invocation of
trigger algorithm:– use raw Ethernet/IP and zero-copy sockets, etc…– use real time scheduling, pre-emptive system
calls and low context switching latency •Queuing due to statistical fluctuations in the
processing time of earlier events– keep number of nodes in the subfarm high
keep number of subfarms low
Niko NEUFELDCERN, EP
11
Context Switching Latency
•What is it?– On a multi-tasking OS, whenever the OS
switches from one process to another it needs a certain time to do this
•Why do we worry?– Because we run the L1 and the HLT
algorithms concurrently on each CPU node•Why do we want this concurrency?
– We want to minimise the idle-time of the CPUs– We cannot use double-buffering in the L1
(latency budget would be half-ed!)
Niko NEUFELDCERN, EP
12
Scheduling and Latency
• Using Linux 2.5.55 we have established two facts about the scheduler:– Realtime priorities work: the L1 task will never be
interrupted until it finishes– The context switch latency is low: 10.1 ± 0.2 µs
• Measurements of this have been done on a high-end server 2.4 GHz PIV Xeon – 400 MHz FSB – we should have machines at least 2x faster in 2007
• Conclusion: the scheme of running both tasks concurrently is sound
Niko NEUFELDCERN, EP
13
Latency due to queuing0.1 % of events have a timeoutlarger than the 30 ms cut-off
Ptolemy simulation:•Processing time distributionfrom number of clusters•Assuming 9 processors anda shared L1 triggerrate of 9 kHz per sub-farm•10^6 L0 accepted events, one of 120 subfarms
Niko NEUFELDCERN, EP
14
Beating the statistics of small numbers
Subfarm now with 18 nodesand sharing ~ 18 kHz ofL1 trigger one of 60 sub farms. Total number of CPUs in the system constant
Only 0.05 % of events have a timeout largerthan 30 ms minimise number of sub-farms
Niko NEUFELDCERN, EP
15
Implementation
• SFC is either a high performance (better than 2 Gigabit sustained I/O) PC or a single NP module
• Farm nodes are disk-less, booted from network, running (most likely) Linux– rack-mounted PCs (1U or blade servers) single or dual
CPU• The farm will be installed in UX8
– limits in floor/rack space and cooling power• Joint studies for rack cooling and physical
realisation (optimal cabling, mechanics, etc…) ongoing