Diana Scannicchio (D.F.N.T. and I.N.F.N of Pavia) - T/DAQ Workshop - Beatenberg 6-10 Dec 1999 1 Event Filter on SMP architecture Subfarm design Tests

Diana Scannicchio (D.F.N.T. and I.N.F.N of Pavia) - T/DAQ Workshop - Beatenberg 6-10 Dec 1999


1

Event Filter on SMP architecture Subfarm design Tests and results Conclusions

Andrea Negri

Giacomo Polesello

Diana Scannicchio

Cristian Stanescu

Valerio Vercesi



2

Symmetric Multi Processor The SMP architecture offers evident advantages in data sharing and transfer

between the different hardware and software components all processors can access symmetrically the main memory and many other

system resources through a very high speed system interconnect (system bus, crossbar switch, ...)

In the development phase of an Event Filter subfarm system one should avoid as much as possible interferences of critical operating system aspects in

the sub-farm code implementation itself obtain a better reliability of both hardware and software component

The EF in the subfarm has been implemented on a commercial SMP with proprietary operating system

The technical choice has been an HP SMP server running version 11.0 of the HP-UX operating system that provides kernel level POSIX thread and is POSIX 1003.1c compliant (draft 10)

After gaining experience with this implementation, the prototype has also been easily ported on an SMP commodity PC running Linux OS



3

POSIX compliance allows for an easy porting of the code on other operating systems obeying the same standard: all the EF code has been written according to POSIX the subfarm has been already ported in other environments (Solaris,

Tru64-Unix, Linux)

To better exploit the hardware architecture all the subfarm components have been implemented within a single multi-thread process every component is assigned a thread scheduled directly by the OS kernel

(“1x1” scheduling model: to each user thread corresponds one thread in the kernel)

One obvious by-product is that load balancing, a critical parameter in the sub-farm operation, is automatically provided by the OS scheduler

The choice of the multi-threaded implementation stems from the fact that it eases in particular the communication and the synchronisation among the different subfarm components



4

The sub-farm implementation has been tested on

20 CPU, PA-8500 440 MHz 0.5+1 MB L1 cache on chip, 15 GB/s 8x8 crossbar hyperplane, 16 GB RAM

8 CPU, PA-8500 440 MHz 0.5+1 MB L1 cache on chip, 4 GB/s system bandwidth shared across two system bus, 8 GB RAM

COMPAQ ProLiant 5500 4 CPU, PII XEON 400 MHz 512 KB L2 cache, 512 MB RAM

HP K220 HP N4000

HP Exemplar V2500 HP Exemplar V2500

4 CPU, PA-7200 120 MHz 1+1 MB L2 cache, 512 MB RAM



5

K220 N4000 V2500ProLiant

5500

SPECint95 6.4 34.0 34.0 15.3

SPECfp95 9.1 51.4 51.4 11

SPECint_rate95 228 2403 5300 594

SPECfp_rate 95 275 2075

We acknowledge CILEA (Consorzio Interuniversitario Lombardo per l’Elaborazione Automatica, a computer centre located near Milan) for dedicating us the servers N4000 and V2500, allowing us to perform the necessary tests



6

Subfarm Design

SFOStorage

SFIEvent backup

dele

te e

vent

bac

kup

Distributor FIFO(physics)

PTPTPTcal.

PTcal.PT

SFO

Distributor FIFO(calibration)

Collector FIFOCollector FIFO

Su

per

viso

rS

up

ervi

sor



7

The Distributor and the Collector are implemented by FIFOs whose availability (provided by semaphores based on mutexes and condition variables) regulates the flow of the events in the subfarm

The SFI or injector thread stores the events in DGB, selects them according to their type (e.g. physics, calibration) and fills the two Distributor FIFOs

The PT threads get the events, process them and fill the Collector FIFO (with the filtered ones) which eventually is emptied by the SFO thread The “physics” PT runs the Calorec++ ATLAS EM Calorimeter reconstruction

software (developed by C. Meessen) The “calibration” PT consumes CPU by mathematical operation; its

processing time is set to ~10% of the Calorec++ PT

The DGB has to ensure that events are not lost during the passage through the subfarm and has been implemented as a disk partition on which the events are stored as different files and are removed after having been rejected by the PT or disposed of by the SFO



8

This is the prototype design compliant to Version 3 of the subfarm

Only one process, every component is assigned a thread

In this implementation the mechanism of control is embedded in the use of the POSIX thread library, that provides several system functions for the management of the thread associated to the component

the component statistics is visible from the whole subfarm (global variables)



9

The error handling has been implemented exploiting all the means provided by the thread POSIX libraries

Some tests to check the error handling have been performed

Simulating a system crash the process has been killed the events that were still to be processed have been found in the DGB

and the recovery system embedded in the multi-thread implementation ensures that they are firstly processed when the subfarm restarts before accepting new events from the Distributor: no event is lost

Causing errors in the PT threads the threads have been killed the crashed thread is identified and deleted and a new one is created

Error handling and recovery



10

All tests have been performed on different machines and platforms 4 CPU HP K220 (PA-7200, 120 MHz) 8 CPU HP N4000 (PA-8500, 440 MHz) 20 CPU HP V2500 (PA-8500, 440 MHz) 4 CPU COMPAQ ProLiant 5500 (PII XEON, 400 MHz) running Linux OS

We have performed tests using different initial conditions: number of CPUs number of PTs (“physics” and “calibration”) event size processing time (looping many times the reconstruction software to

simulate different realistic values) use of the DGB

The size of the ATLAS EM Calorimeter MC events (~50 KB) is padded to 250 KB or to 1 MB to simulate the realistic size of an ATLAS event and to 100 KB to simulate the “calibration” events

Conditions of the Tests

running HP-UX 11 OS



11

The tests aimed to measure the performances prove absence of bottleneck due to software or hardware test the scalability

All results show that the software and hardware architectures do not limit the behaviour of the subfarm

The global throughput is independent of the number of PTs (from 4 to 400) is inversely proportional to the processing time is independent of the event size scales according to the number of CPUs (up to 20 on V2500)

Running concurrently the two different types of PTs (“physics” and “calibration”) does not change the load balancing: each PT is balanced with the others of the same type the relative composition in the number of PTs does not affect the previous

result

Results



12

To better simulate realistic, variable processing times the tests have been also performed with a random number of loops of the reconstruction software (from 0.2 s to 4 s on K220) the system scheduler still balances the processing tasks

The use of DGB influences the performance reducing the global throughput as expected, depending on the hardware used to implement it (FW SCSI disk, FiberChannel array, ...)

The use of DGB becomes negligible with increasing processing time

The results of the tests performed on an Intel based SMP (PII XEON) running Linux (RedHat 5.3, kernel 2.2.6) provides the same results obtained with HP, proving the platform independence and giving way to a low cost high performance implementation

We have performed also a long term reliability test: the subfarm processed more than 4 millions events in 3 days without any problem



13

Throughput vs. number of PTs



14

Throughput (HP K220)

One processing time unit is ~220 msec



15

Series of runs performed increasing the number of the active CPUs in the V2500 with different conditions (event size and DGB)

Scalability



16

8 “physics” PT and 8 “calibration” PT

Load Balancing



17

16 “physics” PT and 4 “calibration” PT

Load Balancing



18

8 “physics” PT and

8 “calibration” PT

K220

Load Balancing

400 “physics”PT

N4000



19

24 PT

N4000

Load Balancing

Main thread

SFI thread

SFO thread



20

One processing time unit is ~40 msec on N4000 and ~220 msec on K220

Throughput and DGB usage



21

Throughput (ProLiant 5500)



22

The Event Filter subfarm has been successfully implemented on different SMP machines

The POSIX standard ensures an easy porting of the code (only recompilation / linking is needed for the different platform)

The software robustness has been checked by proving that the global throughput is independent of the number of PTs is inversely proportional to the processing time is independent of the event size up to 1 MB scales almost perfectly with the number of CPUs is independent of the platform

The hardware robustness has been proved testing the thread error recovery, the functionality of the DGB and performing long term reliability tests

Conclusions



23

Complete the implementation of the Version 3 (Object Oriented) design of the subfarm

Perform other studies on error handling and on different communication mechanisms

Since the results obtained prove the SMP implementation of the subfarm is completely independent of the platform used, we will perform other tests on 4 and 8 CPU Intel based boards

Outlook

Documents

Diana Scannicchio (D.F.N.T. and I.N.F.N of Pavia) - T/DAQ Workshop - Beatenberg 6-10 Dec 1999 1 Event Filter on SMP architecture Subfarm design Tests