1 Farm Issues L1HLT Implementation Review Niko Neufeld, CERN-EP Tuesday, April 29 th

1

Farm Issues

L1&HLT Implementation ReviewNiko Neufeld, CERN-EP

Tuesday, April 29th

Niko NEUFELDCERN, EP

2

Overview

•Requirements•Architecture•Protocols & Dataflow•Latencies•Implementation


3

Requirements• Scalable up to several thousand CPUs• Organised in sub-farms, which perform local load

balancing, hide large numbers from RUs• Support partitioning (at the subfarm level)• Interface to the throttle via Experiment Control

System (ECS)• Must fit within cooling and space limits in UX8• Low latency for data movement and latency

control• Allow concurrent, seamless usage for L1 and HLT

algorithms, running standard OS, while prioritising L1 traffic wherever possible


4

The Event Filter Farm

MultiplexingLayer

FE FE FE FE FE FE FE FE FE FE FE FE

Switch Switch

NP NP NP

NP NP NP NP

SFC SFC SFC SFC SFC SFC

Farm CPUs ~1200 CPUs

Level-1Traffic

HLTTraffic

125-239Links

1.1 MHz8.8-16.9 GB/s

349Links

40 kHz2.3 GB/s

30 Switches

24 NPs77-135 NPs

77-135 Links6.4-13.6 GB/s

24 Links1.5 GB/s

73-140 Links7.9-15.1 GB/s

50-100 SFCs

37-70 NPs

Front-end Electronics

EventBuilder

Gb EthernetLevel-1 Traffic

Mixed TrafficHLT Traffic

50-100 Links

5.5-10 GB/s

TRM

Sorter

TFCSystem

Readout Network L1-Decision

Switch Switch Switch

StorageSystem


5

Architecture - Farm

NP NPEvent

Builder

Readout Network


NP NP NP

SFC

Switch

CPU

CPU

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPU

CPU

StorageController

ECS


6

Structure•The (initially) 1200 CPUs are distributed

over several sub-farms•To minimise the number of inputs from the

event building network, the number of sub-farms is chosen such that the average link load into a sub-farm is close to 110 MB/s– For the minimal system (VELO + TT), this yields

~ 50 subfarms– It is also advantageous to minimise the number

of sub-farms (while keeping the number of CPUs constant) from the point of view of the local load-balancing (see later)


7

Anatomy of a sub-farm• Each sub-farm consists of a gateway to the

event-builder, the Subfarm Controller SFC, and worker CPUs

• Each subfarm handles an aggregated data-stream of approximately two Gigabits (one in, one out) ( see later)

• The SFC is connected to the worker CPUs by a switch (Ethernet - Layer 2)

• A completely separate network connects the SFC and the worker nodes to the ECS

• The transport protocol is light-weight directly on top of Ethernet (or raw IP if necessary) no TCP


8

Dataflow• Completely assembled events are sent to the SFC as

(several) raw Ethernet frames• The SFC keeps a list of idle worker CPU and forwards

the event to a node– A node buffers only a single L1 event at anytime (latency!).

When no node is free, the event is buffered in the SFC and accumulates extra latency

– A node buffers several HLT events (50 to 100). When all buffers are full, events are buffered in the SFC

• When the high-water marks of the SFC buffer are reached, a throttle signal is issued via the ECS

• The worker CPU processes the event and always sends an answer (= decision)– In case of a L1 event the answer is only yes or no + a short

summary for the L1 sorter– In case of a HLT event the positive answer contains the raw

and reconstructed event data as well• The SFC forwards L1 decisions to the L1 decision

sorter and HLT accepted events to the storage controller


9

Latencies


SFC

Switch

CPU

CPU

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPU

CPU

Reception of event and invocation of trigger algorithm

Queuing in the SFC (“all all nodes are busy with a L1 event”)

Transmitting from switch to CPU node via Ethernet

Forwarding in the subfarm switch


10

Minimising the Latencies

•Transmitting of events (transport time)– use Gigabit Ethernet for the internal subfarm

network (although links loaded only to a few %)•Reception of events and invocation of

trigger algorithm:– use raw Ethernet/IP and zero-copy sockets, etc…– use real time scheduling, pre-emptive system

calls and low context switching latency •Queuing due to statistical fluctuations in the

processing time of earlier events– keep number of nodes in the subfarm high

keep number of subfarms low


11

Context Switching Latency

•What is it?– On a multi-tasking OS, whenever the OS

switches from one process to another it needs a certain time to do this

•Why do we worry?– Because we run the L1 and the HLT

algorithms concurrently on each CPU node•Why do we want this concurrency?

– We want to minimise the idle-time of the CPUs– We cannot use double-buffering in the L1

(latency budget would be half-ed!)


12

Scheduling and Latency

• Using Linux 2.5.55 we have established two facts about the scheduler:– Realtime priorities work: the L1 task will never be

interrupted until it finishes– The context switch latency is low: 10.1 ± 0.2 µs

• Measurements of this have been done on a high-end server 2.4 GHz PIV Xeon – 400 MHz FSB – we should have machines at least 2x faster in 2007

• Conclusion: the scheme of running both tasks concurrently is sound


13

Latency due to queuing0.1 % of events have a timeoutlarger than the 30 ms cut-off

Ptolemy simulation:•Processing time distributionfrom number of clusters•Assuming 9 processors anda shared L1 triggerrate of 9 kHz per sub-farm•10^6 L0 accepted events, one of 120 subfarms


14

Beating the statistics of small numbers

Subfarm now with 18 nodesand sharing ~ 18 kHz ofL1 trigger one of 60 sub farms. Total number of CPUs in the system constant

Only 0.05 % of events have a timeout largerthan 30 ms minimise number of sub-farms


15

Implementation

• SFC is either a high performance (better than 2 Gigabit sustained I/O) PC or a single NP module

• Farm nodes are disk-less, booted from network, running (most likely) Linux– rack-mounted PCs (1U or blade servers) single or dual

CPU• The farm will be installed in UX8

– limits in floor/rack space and cooling power• Joint studies for rack cooling and physical

realisation (optimal cabling, mechanics, etc…) ongoing

Documents

1 Farm Issues L1HLT Implementation Review Niko Neufeld, CERN-EP Tuesday, April 29 th