Download ppt - LHCb Computing

1

LHCb Computing

Umberto Marconi

On behalf of the Bologna GroupCastiadas, May 27, 2004

U. MarconiINFN, Bologna

2

Overview

■ Introduction■ The LHCb experiment■ Computing Activities

●Online

●Offline


3

Introduction

■ Amongst the LHCb Italian groups, Bologna is the only one involved in computing activities, both in the online and offline world.● Project of the online computing farm of the L1&HLT

trigger● Representative of LHCb in the INFN Tier-1 computing

centre at CNAF● Italian representative in the LHCb National Computing

Board

■ We are also working to develop and test the analysis and simulation tools


4

Beauty Physics at LHCb

■ Aim of LHCb is to study the dynamics of the beauty quark, with the main goal of measuring the CP symmetry violation in this sector.

■ The proton-proton collisions at 14TeV, at the frequency of 40 MHz, of LHC, can be exploited as intense source of beauty quarks● The expected intensity of beauty production is of about 0.1 MHz● The signal to noise ratio at the source is expected of the order of

1/160■ Useful processes are those related to the oscillations

and decays of the Bd and Bs neutral mesons ● A “rare decay process” is a decay mode whose probabilities is

of the order or below 10-4

■ The intensity of the interesting processes are expected at the level or below

0.1MHz x 10-4 = 10Hz


5

CP Symmetry in (very) few words

CP symmetry works rather well but it is violated

P

C

P

CCP

1.00648 0.00035rate ( )

rate ( )L e

L e

K e

K e


6

The LHCb Experiment

■ Key feature used to select the Bd and Bs meson decays is their relative long lifetime (1.5x10-12s)

0Bb

d

0Bbd

0sB bs

0sBbs


7

VELO: 21stations (Rmin= 8mm)

Si 220 mm, strips R e φ TT~1.41.2 m2

Si microstrips

~65 m2

3 Tracking stations

IT : Si strips

OT: straw tubes


8

Trigger Architecture

■ Level-0 Trigger ● Fixed latency, 4μs ● Reduce the ~10MHz visible

interaction rate to 1.1 MHz● Select

▪ The highest ET hadron, electron, photon

▪ The two highest pT muons■ Level-1 Trigger

● Variable latency, 58ms max● Output rate is fixed at 40KHz● Decisions are delivered

chronologically ordered● Event selection: B vertex

■ HLT ● Variable latency● Output rate established at

200 Hz● Event selection: algorithms

for specific decay modes


9

L1 & HLT Implementation

■ System design ingredients:

● data rates per front-end board (@ L1 and HLT)

● protocol overheads

● required CPU power # CPUs

■ ~300 readout front-end boards to be connected to a ~2000 CPUs ■ The system needs to be affordable & scalable ■ Want to use mostly commercial – commodity – components ■ Solution: use a large Ethernet Local Area Network (LAN) and PCs

cheap, reliable, commercial & (mostly) commodity

size of thenetwork

size of the farm


10

Key Features

■ Push-through protocol simple, scalable

■ Distributed global flow-control (throttle) via Timing and Fast Control (TFC) system disable trigger temporarily to avoid buffer overflows

■ Data are sent as IP packets can be used with any standard network equipment

■ Data for several consecutive triggers are packed into Multi Event Packets and sent as a single packet over the network reduces packet rate and transport overheads

■ CPU-farm is partitioned into sub-farms reduces the connectivity problem

■ Sub-farms are assigned centrally by the TFC system central static load balancing

■ Event-building and dynamic load-balancing is done by the Subfarm Controller (SFC)

■ Single CPU farm is used for both L1 & HLT. L1 runs as a priority task, HLT in the back-ground on each node minimises latency for L1 and overall idle time, seamless redistribution of computing power


11

L1 Event-Building

■ Upon reception of L0 yes, data are stored in the L1-buffers. VELO, TT, L0DU and Calorimeter Selection Crate pack data into a MEP

■ When a MEP is full, the TFC sends the destination address for this MEP. The boards send the MEP as IP packets

■ The packet is routed through the the readout-network to the Subfarm Controller (SFC)

■ The SFC collects all MEPs, assembles the events and sends individual events to a free CPU

■ The CPU reports back to the SFC with a L1 decision and discards the data

■ The SFC sends the decision to the L1 trigger decision sorter

MultiplexingLayer

FE FE FE FE FE FE FE FE FE FE FE FE

Switch Switch

Level-1Traffic

HLTTraffic

126Links

44 kHz5.5 GB/s

323Links4 kHz

1.6 GB/s

29 Switches

32 Links

94 SFCs

Front-end Electronics

Level-1 Traffic

Mixed Traffic

Gb Ethernet

HLT Traffic

94 Links7.1 GB/s

TRM

Sorter

TFCSystem

L1-Decision

StorageSystem

Readout Network

Switch Switch Switch

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPUFarm

62 Switches

64 Links88 kHz

~1800 CPUs


12

HLT Event-Building

■ The Timing and Fast Control system receives the L1 trigger decisions from the L1 Decision Sorter. It then broadcasts its decision to the readout boards.

■ All readout boards send their events to the SFC

■ The SFC collects all fragments, assembles events and distributes them to a CPU

■ The CPU runs the HLT algorithm and reports back with either a negative decision or the event with reconstructed and raw data

■ The SFC forwards the event to permanent storage Gb Ethernet

Level-1 Traffic

Mixed Traffic

HLT Traffic

MultiplexingLayer


Switch Switch

Level-1Traffic

HLTTraffic

126Links44 kHz

5.5 GB/s

323Links4 kHz

1.6 GB/s

29 Switches

32 Links

94 SFCs


94 Links7.1 GB/s

TRM

Sorter

TFCSystem

L1-Decision

StorageSystem

Readout Network


SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPUFarm

62 Switches

64 Links88 kHz

~1800 CPUs


13


CPUFarm

How the system works


Switch Switch

94 SFCs

Gb Ethernet

Level-1 Traffic

Mixed Traffic

HLT Traffic

94 Links7.1 GB/s

TRM

Sorter

TFCSystem

L1-Decision

StorageSystem

Readout Network


SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

~1800 CPUs

1

21

L0Yes

2

L1TriggerL1

D

L1Yes

12

21

HLTYes

BΦΚs

CPUFarm


14

Main Readout Network

■ A possible solution for the readout network based on 24 ports Gigabit Ethernet switches

■ Its behaviour is being simulated

■ Fully connected

69 input ports

85 output ports

network

1

58

67

211

10

349

D 1:8

D 9:16

Sw 1

Sw 2

D 17:24

Sw

3

Sw

4D

25:32

Sw

18:

22

D 7

2:78

Sw 23:24 D 79:85

Sw

15:

17D

65:

71

Sw 8

D 33:40

D 57:6

4

Sw 5Sw 7

Sw 6D 49:56D 41:48

Sw

9:1

2S

w 1

3 :14


15

Testbed Farm Hardware

● 2 Gigabit Ethernet switches

▪ 2x(3Com 2824), 2x24 ports

● 16 1U rack-mounted PCs

▪ Dual Intel Xeon 2.4 GHz with HyperTrading

▪ 2 GB of RAM

▪ 160 GB IDE disk(but machines operate diskless)

▪ 1 Fast Ethernet and 3 Gigabit Ethernet adapters

▪ 64 bits/133 MHz PCI-X bus

● 1 TB RAID5 disk array with Adaptec RAID controller and Ultra320 SCSI disks

▪ Could be used to feed the SFC with input data at Gigabit rate to perform simulations of the event processing


16

Farm Configuration

■ 16 Nodes running Red Hat 9b, with 2.6.5 kernel● 1 Gateway, acting as bastion host and NAT to the external network● 1 Service PC, providing network boot services, central syslog, time

synchronization, NFS exports, etc.● 1 diskless SFC, with 3 Gigabit links (2 for data and 1 for control traffic)● 13 diskless SFNs (26 physical, 52 logical processors with HT) with 2

Gigabit links (1 for data and 1 for control traffic)■ Root fs mounted on a 150 MB RAM disk (kernel and

compressed RAM disk image download from network at boot time)● RAM disk is automatically created by a set of scripts on admin’s demand

whenever a change is performed on a development root fs area on the service PC, and put online for subsequent reboots

■ /usr and /home mounted via NFS from the service PC● NFS mount points can provide access to the online application binaries


17

Monitoring, configuration and control

■ One critical issue in administering the event filter farm is how to monitor, keep configured and up-to-date, and control each node

■ A stringent requirement of such a control system is that it necessarily has to be interfaced to the general DAQ framework

■ PVSS provides a runtime DB, automatic archiving of data to permanent storage, alarm generation, easy realization of graphical panels, various protocols to communicate via network


18

PVSS-DIM integration

The DIM network communication layer, already

integrated with PVSS, is very suitable for our needs● It is simple and efficient● It allows bi-directional

communication

■ The idea is to run light agents on the farm nodes, providing information to a PVSS project, which publishes them through GUIs, and which can also receive arbitrary complex commands to be executed on the farm nodes passing back the output

PVSS

DIM CLIENT

AGENT


19

Monitoring■ All the relevant quantities useful to diagnose hardware or

configuration problems should be traced● CPU fans and temperatures● Memory occupancy● RAM disk filesystem occupancy● CPU load● Network interface statistics, counters, errors● TCP/IP stack counters● Status of relevant processes● Network Switch statistics (via the SNMP-PVSS interface)● … plus many other things to be learnt by experience

■ Information should be viewed as actual values and/or historical trends

■ Alarms should be issued whenever relevant quantities don’t fit in allowed ranges● PVSS naturally allows it, and can even start feedback procedures


20

Configuration and control

■ The idea is to embed in the framework every common operation which is usually needed by the sysadm, to be performed by means of GUIs

■ On the Service PCs side● Upgrade of operating systems

▪ RAM disk creation, kernel upgrades, etc.

● Upgrade of application software

▪ Put online new versions of online programs, utilities, upgrade of bugged packages,…

● Automatic setup of configuration files

▪ dhcpd table, NFS exports table, etc.

■ On the farm nodes side● Inspection and modification of files

● Broadcast commands to the entire farm (e.g., reboot)

● Fast logon by means of a shell like environment embedded inside a PVSS GUI (e.g., commands, stdout and stderr passed back and forth by DIM)

● (Re)start of online processes

● …


21

Datagram Loss over Gigabit Ethernet

■ A IP datagram loss implies an unpleasant Multi Event Packet loss

■ But we can’t use reliable protocols since re-transmission of data from the read-out trigger boards to the filter farm would introduce unpredictable delays

■ LHCb measured a very good BER of 10-14 on copper cables: the BER level, on a copper cable 100 m long, according IEEE 802.3, is of about 10-10

■ We measured also● The datagram loss in the OS IP stack.● The Ethernet frame loss in the level 2 switches

■ We got the best system performances in a point-to-point transmission using IP datagram of 4096 B:● Data flux: 999.90 Mb/s.● Percentage of the datagram loss: 7.1x10-10.


22

Offline Computing■ Waiting for 2007, offline computing activities mainly consist

in the mass production of Monte Carlo events and MC data analysis

■ LHCb is performing in this period the 2004 Data Challenge● It has just started these days● The target is to produce about 180M events in a run period of ~2-3

months, to be used for HLT and physics background studies● The events will be subsequently analysed on-site where they have

been produced (providing that the site have stored them on local storage)

● A Computing TDR is going to be written at the beginning of 2005, based on the results of this year

■ ~20 Computing Centres (including CERN) of various European countries participate to the Data Challenge● 2500 processors in total are expected to be used this year (~1600 at

the moment)


23

LHCb DC’04 (I)

■ LHCb adopts two ways to produce MC data:● LHCb has developed its own production system (DIRAC),

without using the LCG GRID▪ DIRAC uses a pull mechanism to fetch jobs from a central

server, with agents running on the Computing Elements of the variuos production centres

▪ DIRAC agents perform unattended automatic installation of specific software and libraries when needed by a job, submit the jobs, sends the output data and logs to the Storage Elements, update bookkeeping databases and replica file catalogues, performs job monitoring and accounting of the used resources

● LHCb can also submit jobs through LCG-2▪ In the LHCb DC’04 a first round of production is being performed

by DIRAC, then after a testing phase, LCG will smoothly grow and replace DIRAC for the second part of the data challenge


24

LHCb DC’04 (II)

■ DST files (the last step of the Monte Carlo production chain to be used for data analysis) are produced in Bologna at a rate of 150 GB/day

■ After the job completion, DST data are stored by DIRAC to● local disk servers (NFS)● Tier-1 Castor tape Mass Storage (RFIO)● CERN Mass Storage (BBFTP or GRIDFTP)

■ Data produced at the Tier-1 and stored on Castor are made available for external usage by a BBFTP server and a GRIDFTP server


25

LHCb DC’04 (III)■ LHCb Italy is participating to the DC with order of

400 processors (200k SPECint) at the INFN Tier-1● In this very moment it is the most important regional

centre with an amount of resources comparable to CERN

ITALY

CERN


26

Some ideas for high throughput analysis

WN 1

WN 2

WN m

MGR

I/O nodes

ManagementNode

Clients

Eth

ern

et

swit

ch

ION 1

ION 2

ION n

WN 1

WN 2

WN m

Clients

Eth

ern

et

swit

ch

NAS

A more effective solution:Parallel File System

Classic solution:Network Attached

Storage

bottleneck


27

Parallel Virtual File System (PVFS) Performance

■ Using 12 I/O nodes connected through 100Base-T to 100 clients simultaneously reading data we measured an aggregate I/O of ~100 MB/s It can be compared to:● 20-40 MB/s (local disk)● 5-10 MB/s (NAS 100Base-T)● 20-30 MB/s – very optimistically (NAS 1000Base-T)

■ We have successfully used such a system during 2003 for LHCb massive data analysis at Bologna with outstanding results● We plan to work in strict collaboration with Tier-1 Staff to setup a

testbed to compare various Parallel Filesystem implementations■ We believe this is a very promising approach to massive

data analysis at the LHC scale


28

Acknowledgments

We want to thanks the Computig Staff at INFN Bologna for their support in building the L1&HLT event filter farm

We want to express our sincere thanks to the Tier-1 Management and Staff at CNAF for their precious efforts in providing a high quality infrastructure and support, fighting every day with a plenty of technical issues connected to the construction and maintenance of such a large Computing Centre, which is emerging as one of the most important HEP-dedicated European Computing Centres