10
EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26- 05-2004

EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26-05-2004

Embed Size (px)

Citation preview

Page 1: EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26-05-2004

EGO Computing Center site report

EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI)  |

Stefano Cortese

INFN Computing Workshop – 26-05-2004

Page 2: EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26-05-2004

Alpha/OSF wks:

15 control

10 processing

2 servers

x86/Linux nodes:

12 processing

4 servers

4 control

LINUX farm nodes

16 nodes: 54 Gflop

2 servers

13 storage nodes

25 linux PCs

On-line buffers

5TB

OFF-LINE COMPUTINGOFF-LINE COMPUTING

60 LynxOS CPUs

10 OS9 CPUs

150 users

Virgo-EGO computing areas

ON-LINE

PROCESSING/ IN-TIME domain

ON-LINE

PROCESSING/ IN-TIME domain

Interferometer

Real Time Domain

Interferometer

Real Time Domain

Monitoring and

Control

Monitoring and

ControlDAQ

6 MB/s

On-line

Analysis

5-(300)

Gflops

Tape Backup

6TB

> 6 MB/s

Disk

Storage

70TB

Users computing

and Data Access34-(155) Mbps

Bologna and

Lyon

Repositories

OFFICE SERVICES

50 windows PCs

OFFICE SERVICES

50 windows PCs

INTERNET SERVICESINTERNET SERVICESSo

ftw

are

in

teg

rati

on

te

sti

ng

, a

rch

ivin

g a

nd

in

sta

lla

tio

n

Page 3: EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26-05-2004

Virgo-EGO LANs

Firewall CheckPoint over Nokia

3 KM

3 KM

Virgo Interferometer network (> 50 switches)

Offices

UPS and generators

Data Analysis network ( 7 switches)

General windows PCs network ( 30 switches)

WAN

DMZ

34 Mbps

Page 4: EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26-05-2004

Storage

HP_Compaq MA8000

Fiber Channel-To-SCSI

4 Terabytes

1 Week buffering

Backup via Legato

6MB/s

6MB/s

Accusys

SCSI-To-SCSI

1 Terabyte

Tape Library HP LTO ULTRIUM-1

6TB near-line

Cataloguing

MD5sum Virgo Data

Redundant Stream

Migration to mass storage

Storage FARM: 13 nodes with

70 TB of net RAID5 space

25 Infortrend FC-to-IDE arrays

4 Fibrenetix Zero-d SCSI-to-IDE

+ some 3ware

Mainly 250GB/7200rpm WD disksEverything with Linux RH9 and LVM 1.0.x

Page 5: EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26-05-2004

Performances are good: 50/60MBytes/s over 1TB RAID5 set (single array, 5400/7200rpm disks)

The quality of the first releases of the products are very poor due mainly to firmware bugs or hardware tolerances that ultimately lead to hidden data corruption (meaning undetected by storage controller or operating system)

We developed a procedure for storage acceptance:

• requirement of minimal performances according to market survey and in demo testing

• Tenders are required in 2 lots, the first is for the acceptance test. The positive validation of the first lot is required for the acceptance of the second

Acceptance Test:

• Performances tested with “bonnie”

• Data integrity checked with a continuous benchmark that reads, writes and deletes data with 128bit MD5 verification after each operation

Storage practices with IDE based arrays

Page 6: EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26-05-2004

Data Integrity test:

• The data integrity test ends after the processing of about 30TB (after about 10days) giving confidence that the BER is less than 3x10-14

• In our experience the errors may occur even after 1 week of processing and we rejected many configurations

The test needs to be repeated at each new firmware release installation, even if new features only are introduced

All this Is not enough:

• Many functions of the firmware may happen to be executed after the systems is running since a long time.That is the case of block remapping following bad block occurrences on the disks (this could only be tested using really bad disks)

Therefore:

• The storage must be periodically monitored for data integrity

• The firmware must provide the on-line low level media verification that must executed periodically to avoid the double bad-blocks or bad-block+disk-failure cases

Storage practices: data integrity

Page 7: EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26-05-2004

IDE based storage systems at 5000€/TB are good for mass storage with fast access and high density compared to near-line disk-cache/tape systems but availability is not guaranteed at all times

They don’t offer the same level of reliability for critical tasks as more expensive disk based storage. Duplication or tape backup is still needed

Direct Attached Arrays are preferable respect to NAS storage to be able to run tests independently of the networkWe prefer also arrays connected via standard buses (e.g. SCSI or FC) rather than “on server” controllers to avoid intermixing OS/driver/array problems

LVM and automounter are required tools for mounting and serving about 100 file-systems (currently using amd, planning to pass to autofs on Linux)

Storage Conclusions

Page 8: EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26-05-2004

On-line ComputingVirgo detection channels are extracted

from rawdata and processed to obtain the h-reconstructed signal where the Gravitational signal must be found

8 bi-processors

Intel Xeon 2.66GHz

2 bi-processors

Intel Xeon 2.0GHz

The h-reconstructed signal (16-600 KB/s) is fed to the computing farms for on-line search

Small Scale Test System (2002)

16 bi-processors Compaq W60001.7GHz + PC800 RDRAM

2 front-ends

2 Standard gigabit ethernet LANs (internodes and storage)

Page 9: EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26-05-2004

On-line Computing:

Physical problem of coalescing binaries “in-time” detection was estimated by Virgo to require a 300GFlop system

Flat search with Matched filtering via FFT with templates of various length is very dependent on the amount of RAM available for storing the templates, so the naive sizing by CPU power is not enough

A benchmarking Virgo/EGO workgroup has been working since beginning of the year to arrive at more precise specifications (benchmark provided by Perugia group, tests performed by EGO)

Page 10: EGO Computing Center site report EGO - Via E. Amaldi 56021 S. Stefano a Macerata - Cascina (PI) | Stefano Cortese INFN Computing Workshop – 26-05-2004

Overall problem

Opteron has the best speedup for SIMD problems where data are partitioned among processors: up to 60MB/s of template floats processed per CPU for the Virgo benchmark

The Maximum RAM supported by the platform has an impact on the number of CPUs

Overall Virgo problem for a space of 200.000 templates (1.6 TB RAM) to be processed in 256s would require about 200 opteron with 8GB/CPU or 130 Itanium with 12GB/CPU

Opteron has a higher performance per rack-unit

Current tender is for 64 CPUs