RHIC, STAR computing towards distributed computing on the Open Science Grid Jérôme LAURET RHIC/STAR

RHIC, STAR computing towards distributed computing on the

Open Science Grid

Jérôme LAURETRHIC/STAR

Jérôme LAURETIWLSC, Kolkata India 2006 2

Outline

The RHIC program, complex and experiments An overview of the RHIC Computing facility

Expansion model Local Resources, remote usage

Disk storage, a “distributed” paradigm Phenix and STAR

STAR Grid program & tools SRM / DataMover SUMS GridCollector

Brief overview of the Open Science Grid STAR on OSG








The Relativistic Heavy Ion Collider (RHIC) complex & experiments A world-leading scientific program

in Heavy-Ion and Spin program The Largest running NP

experiment Located in Long Island, New York,

USA Flexibility is key to understanding

complicated systems Polarize protons sqrt(s) = 50-500

GeV Nuclei from d to Au, sqrt(sNN) = 20-

200 GeV Physics runs to date

Au+Au @ 20, 62, 130, 200 GeV Polarized p+p @ 62, 200 GeV D+Au @ 200 GeV

RHIC

It is becoming the world leader in the scientific quest toward understanding how mass and spin combine into a coherent picture of the fundamental building blocks nature uses for atomic nuclei. It is also providing unique insight into how quark and gluons behaved collectively at the very first moment our universe was born.


Complementary ExperimentsDiscovery and characterization of

the QGP

STARSTARSTAR

The experiments

PHENIXBRAHMS &PP2PPPHOBOS

STAR 1.2 kmRHIC








The RHIC Computing Facility (RCF)

RHIC Computing Facility (RCF) at BNL Tier0 for the RHIC program

Online recording of Raw data Production reconstruction of all (most) Raw data Facility for data selection (mining) and analysis Long term archiving and serving of all data

… but not sized for Monte Carlo generation Equipment refresh funding (~25% annual replacement)

Addressing obsolescence Results in important collateral capacity growth


Tier1, Tier2, … remote facilities

Remote Facilities Primary source of Monte Carlo data Significant analysis activity (equal in the case of STAR) Such sites are operational – the top 3

STAR• NERSC/PDSF, LBNL• Wayne State University• Sao Paolo

PHENIX• RIKEN, Japan• Center for High Performance Computing, University of New Mexico• VAMPIRE cluster, Vanderbilt University

Grid Computing Promising new direction in remote (distributed) computing STAR and, to a lesser extent, PHENIX are now active in Grid

computing


Key sub-systems Mass Storage System

Hierarchical Storage Management by HPSS 4 StorageTek robotic tape silos ~4.5 PBytes 40 StorageTek 9940b tape drives ~1.2 GB/sec Change to technology to LTO drive this year

CPU Intel/Linux dual racked processor systems ~ 2300 CPU’s for ~1800 kSPECint2000 Mixed of Condor & LSF based LRMS

Central Disk 170 TBytes of RAID 5 storage Other storage solution: PANASAS, … 32 Sun/Solaris SMP NFS servers ~1.3 GByte/sec

Distribute disk ~ 400 TBytes x2.3 more than centralized storage !!!


How does it look like …

Not like these … although …


MSS, CPUs, Central Store

… but like these or similar(the chairs donot seem morecomfortable)


Data recording rates

Run4 set a first record

STAR

PHENIX

120MBytes/sec

120MBytes/sec


DAQ rates comparative

CDF

A very good talk from Martin Putchke - CHEP04Concepts and technologies used in contemporary DAQ systems

LHCbLHCb

ALICEALICE

CMSCMSATLASATLAS

~ 25

~ 100

~ 40

150 MB/sec

~ 300

~1250Heavy Ion Experimentsare in the > 100 MB/secrange

All in MB/sec, approximate ...

STAR moving to x 10 capabilities inouter years (2008+)


Mid to long term computing needs

Computing projection model Goal is to estimate CPU, disk, mass storage and network

capacities Model based on raw data scaling Moore’s law used for cost recession

Feedback from the experimental groups Annual meetings, model refined if necessary (has been stable

for a while) Estimate based on beam use plans

May be offset by experiment, by year Integral consistent

Maturity factor for codes Number of reconstruction passes “richness” factor for the data (density of interesting events)


Projected needs2005 2006 2007 2008 2009 2010 2011 2012

1800 2000 3700 8000 17000 27500 37700 62200

350 440 680 2000 3500 5600 7300 8100170 170 210 340 610 930 1200 140025 24 32 53 74 84 82 85

5 5 8 13 20 30 43 551400 1900 2800 3300 5200 6700 8000 8600

Tape (PBytes)Tape (MBytes/sec)

CPU (kSI2k)Distributed Disk (TBytes)Central Disk (Tbytes)Disk (GBytes/sec)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

FY05 FY06 FY07 FY08 FY09 FY10 FY11 FY12

Year

Tape Volume (TB)

STAR

Phenix

RCF Capacity Profile

-

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

2004 2005 2006 2007 2008 2009 2010 2011 2012

Running Year

Dis

k S

tora

ge

(TB

)

Cent. Disk (TBytes)

Distrib. Disk (TBytes)


Discussion of model

Data amount is accurate at 20% close i.e. model was adjusted to 20% lower-end Upper-end has larger impact in the outer years DAQ1000 for STAR enabling Billions of event capabilities a

major (cost) factor driven by Physics demand

Cost will be beyond current provision Tough years start as soon as 2008 Gets better in the outer years (Moore’s law catches up) Uncertainties grows with time however …

Cost versus Moore’s law Implies “aggressive” technology upgrades (HPSS for example) Strategy heavily based on low cost distributed disk (cheap, CE

attached)








Disk storage – distributed paradigm

Disk storage – distributed paradigm The ratio is striking x2.3 ratio now, moves

to x6 in outer years Requires SE strategy

CPU shortfall Tier1 use (Phenix, STAR) Tier2 user analysis and data on demand (STAR)

RCF Capacity Profile

-

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

2004 2005 2006 2007 2008 2009 2010 2011 2012

Running Year

Dis

k S

tora

ge

(TB

)

Cent. Disk (TBytes)

Distrib. Disk (TBytes)


Phenix – dCache model

Tier1 / CC-J / RIKEN

dCache

Admin Node

Gri

dFT

PWorker Nodes

dCache

Admin Node

Gri

dFT

PWorker Nodes

MSS = HPSSCentral stores

Tier 0 – Tier 1 model Provides scalability for centralized

storage Smooth(er) distributed disk model


Phenix – Data transfer to RIKEN

Network transfer rates of 700-750 Mbits/s could be achieved (i.e. ~90 MB/sec)


STAR – SRM, GridCollector, XrootdDifferent approach Large (early) pool of

distributed disks, early adoption of dd model dd model too home-grown Did not scale well when mixing

dd and central disks Tier 0 – Tier X (X=1 or 2)

model Need something easy to deploy,

easy to maintain Leveraging on SRM

experience Data on demand Embryonic event level

(GridCollector) Xrootd could benefit from an

SRM back-end

olbdolbd

xrootdxrootd

olbdolbd

xrootdxrootd

olbdolbd

xrootdxrootd

olbdolbd

xrootdxrootd

olbdolbd

xrootdxrootd

olbdolbd

xrootdxrootd

Manager(Head Node)

Supervisor(Intermediate Node)

Data Server(Leaf Node)


STAR dd Evolution – From this Where does this data go ??

D

D

D

D

DataCarousel

Client Scriptadds records Pftp on local disk

FileCatalog Management

Update FileLocationsMark {un-}availableSpider and update *

Control Nodes

VERY HOMEMADEVERY “STATIC”


STAR dd Evolution – … to that … Entire layer for Cataloguing is goneLayer for restore from MSS to dd gone

D

D

D

D

Pftp on local disk

DATA ON DEMAND

XROOTD provides load balancing, possiblyscalability, a way to avoid LFN/PFNtranslation ...

But does NOT fit within our SRM invested directions ...

BUT IS IT REALLY SUFFICIENT !!??


Coordination of requests needed

Un-coordinated requests to MSS is a disaster This applies to ANY SE-

related tools Gets worst if the

environment combine technologies (shared infrastructure)

Effect of performance is drastic








STAR Grid program - Motivation

Tier0 production ALL EVENT files get copied on HPSS at the end of a production job Data reduction DAQ to Event to Micro-DST

All MuDST are on “disks” One copy temporarily on centralized storage (NFS), one permanently in HPSS Script checks consistency (job status, presence of files in one and the other) If “sanity” checks (integrity / checksum), register files in Catalog

Re-distribution If registered, MuDST may be “distributed”

Distributed disk on Tier0 sites Tier1 (LBNL) -- Tier2 sites (“private” resources for now)

Use of SRM since 2003 ...

Strategy implies dataset IMMEDIATE replication Allows balancing of analysis Tie0 to Tier1 Data on demand enable Tier2 with capabilities


Needed for immediate exploitation of resources Short / medium term strategy

To distribute data Take advantage of the static data (schedulers, workflow, …)

Advanced strategies Data-on-demand (planner, dataset balancing, data placement …) Selection of sub-sets of data (datasets of datasets, …) Consistent strategy (interoperability?, publishing?, )

Less naïve considerations Job Tracking Packaging Automatic Error recovery, Help desk Networking Advanced workflow, …

SRM / DataMover

STAR Unified Meta-Scheduler

Xrootd, …

GridCollectorSRM back-endsWould enableXrootd with Object on demand

Will leverage existingto come to existencemiddleware or addressone by one …


SRM / DataMoverSRMs are middleware components whose function is to provide SRMs are middleware components whose function is to provide

dynamicdynamic space allocationspace allocation file managementfile management of shared storage of shared storage components on the Gridcomponents on the Grid

SRM SRM SRM

Enstore JASMine

ClientUSER/APPLICATIONS

Grid Middleware

SRM

dCache

SRM

Castor

SRM

Unix-baseddisks

SRM

SE

CCLRC RAL

http://osg-docdb.opensciencegrid.org/0002/000299/001/GSM-WG-GGF15-SRM.pp t


SRM / DataMover Layer on top of SRM In use for BNL/LBNL data transfer for years

All MuDST moved to Tier1 this way Extremely reliable

“Set it, and forget it !” Several 10k files transferred, multiple TB for days, no losses

Project was (IS) extremely useful, production usage in STAR Data availability at remote site as it is produced

We need this NOW Faster analysis is better science and sooner Data safety

Caveat/addition in STAR: RRS (Replica Registration Service) 250k files, 25 TB transferred AND Catalogued 100% reliability Project deliverables on-time


SRM / DataMover – Flow diagram

SRM-COPY(thousands of files)SRM-COPY(thousands of files)

SRM-GET (one file at a time)SRM-GET (one file at a time)

GridFTP GET (pull mode)GridFTP GET (pull mode)

stage filesstage filesarchive filesarchive files

Network transferNetwork transfer

Get listof filesFrom directory

Get listof filesFrom directory

DiskCacheDisk

Cache

DataMover(Command-line Interface)

HRM(performs writes)

LBNL

DiskCacheDisk

Cache

HRM(performs reads)

BNL

BNL LBNL

BNLFile Catalog

BNL FCMirror

LBNLFC

LBNL FCMirror

MySQL

MySQL

RRS

read

write

Files/Datasets Files/DatasetsHRM

Being deployed at Wayne State University Sao Paolo

DRM used in data analysis scenario as light weight SE service (deployable on the fly) All the benefits from SRM

(advanced reservation, …) If we know there IS a storage

space, we can take it No heavy duty SE deployment

NEW


CE/SE decoupling

Srm-copy from execution site DRM back to submission site Submission site DRM is called from execution site WN

Requires outgoing, but not incoming, connection on the WN Srm-copy callback disabled (asynchronous transfer) Batch slot released immediately after srm-copy call Final destination of files is HPSS or disk, owned by user

DRM DRM

Client

Client

Client

/scratch

/scratch

/scratch

.

.

.DRM cache DRM cache

Submission Site Job execution Site


SUMS – The STAR Unified Meta-Scheduler STAR Unified Meta-Scheduler

Gateway to user batch-mode analysis User writes an abstract job description Scheduler submits where files are, where CPU is, ... Collects usage statistics User DO NOT need to know about the RMS layer

Dispatcher and Policy engines DataSet drive - Full catalog implementation & Grid-aware Throttles IO resources, avoid contentions, optimizes on

CPU

/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......

sched1043250413862_1.list / .csh


sched1043250413862_2.list / .csh

<job maxFilesPerProcess="500"> rootMacros/numberOfEventsList.C\

<stdout

/> <input

etype=daq_reco_mudst" preferStorage="local" nFiles="all"/> toURL="file:/star/u/xxx/scheduler/out/" />


sched1043250413862_0.list / .csh

/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......

Query/Wildcardresolution

<?xml version="1.0" encoding="utf-8" ?>

<command>root4star -q -b

(\"$FILELIST\"\)</command>

URL="file:/star/u/xxx/scheduler/out/$JOBID.out"

URL="catalog:star.bnl.gov?production=P02gd,fil

<output fromScratch="*.root"

</job>

Job descriptiontest.xml

Avoid specifying datalocation …


SUMS – The STAR Unified Meta-Scheduler STAR Unified Meta-Scheduler

Gateway to user batch-mode analysis User writes an abstract job description Scheduler submits where files are, where CPU is, ... Collects usage statistics User DO NOT need to know about the RMS layer

Dispatcher and Policy engines DataSet drive - Full catalog implementation & Grid-aware Throttles IO resources, avoid contentions, optimizes on

CPU

BEFORE – VERY choppyAs NFS would impact computational performances

AFTER modulo remainingfarm instability, smoother


SUMS – The STAR Unified Meta-Scheduler, the next generation …

NEW FEATURES RDL in addition of U-JDL Testing grid submission is OVER.

SUMS is production and user analysis ready

Light SRM helping tremendously Need scalability test

Made aware of multiple packaging methods (from ZIP archive to PACMAN)

Tested for simple analysis, need finalizing mixed archiving technology (detail)

Versatile configuration Site can “plug-and-play” Possibility of Multi-VO support

within ONE install

An issue since we have multi10k jobs/day NOW with spikesat 100k (valid) jobs from nervoususers …


GridCollectorUsing an Event Catalog to Speed up User Analysis in Distributed Environment

STAR – event catalog … Based on TAGS produced at reco time Rest on now well tested and robust SRM

(DRM+HRM) deployed in STAR anyhow Immediate Access and managed SE Files moved transparently by delegation to SRM

service BEHING THE SCENE Easier to maintain, prospects are enormous

“Smart” IO-related improvements and home-made formats no faster than using GridCollector (a priori)

• Physicists could get back to physics• And STAR technical personnel better off

supporting GC

It is a WORKING prototype of Grid interactive analysis framework VERY POWERFULL Event “server” based (no

longer files)

1

2

3

4

5

6

0.01 0.1 1

selectivity

sp

ee

du

p

elapsed CPU

GAIN ALWAYS > 1, regardlessof selectivity

root4star -b -q doEvents.C'(25,"select MuDst where Production=P04ie \ and trgSetupName=production62GeV and magScale=ReversedFullField \ and chargedMultiplicity>3300 and NV0>200", "gc,dbon")'


GridCollector – The next step

Can push functionalities “down” Index BitMap technology in ROOT framework Make “a” coordinator “aware” of events (i.e. objects)

Xrootd a good candidate ROOT framework preferred Both would serve as a demonstrator

(immediate benefit to a few experiments …)

Object-On Demand: from files to Object Management - Science Application Partnership (SAP) – SciDAC-II In the OSG program of work as leveraging technologies to

achieve goals








The Open Science Grid

In the US, Grid is moving to the Open Science-Grid An interesting adventure comparable similar European efforts EGEE interoperability at its heart

Character of OSG Distributed ownership of resources. Local Facility policies, priorities, and capabilities need to

be supported. Mix of agreed upon performance expectations and opportunistic resource use. Infrastructure deployment based on the Virtual Data Toolkit. Will incrementally scale the infrastructure with milestones to support stable running of mix of

increasingly complex jobs and data management.

Peer collaboration of computer and application scientists, facility, technology and resource providers “end to end approach”.

Support for many VOs from the large (thousands) to the very small and dynamic (to the single researcher & high school class)

Loosely coupled consistent infrastructure - “Grid of Grids”.


STAR and the OSG STAR could not run on Grid3

Was running at PDSF, a Grid3 site setup in collaboration with our resources

STAR on OSG = a big improvement OSG for an Open Science, not as

strongly LHC sole focus Expanding to other science: revisit of

needs and requirements More resources Greater stability

Currentely Run MC on regular basis (nightly tests,

standard MC) Recently focused on user analysis (light

weight SRM) Helped other site deploy OSG stack

And it shows …

FIRST Functional site in Brazil,Universidade de Sao Paolo, aSTAR institution …

http://www.interactions.org/sgtw/2005/0727/star_saopaulo_more.html


Summary RHIC computing facility provides adequate resources in the

short-term Model imperfect for long term projections Problematic years starting 2008, driven by high data throughput &

physics demands Mid-term issue

This will impact Tier1 as well, assuming a refresh and planning along the same model

Out-sourcing?

Under data “stress” and increasing complexity RHIC experiments have integrated at one level or another distributed

computing principles Data distribution and management Job scheduling, selectivity, … STAR intends to

Take full advantage of OSG & help bring more institutions into the OSG Address the issue of batch oriented user analysis (opportunistic, …)

Documents

RHIC, STAR computing towards distributed computing on the Open Science Grid Jérôme LAURET RHIC/STAR