Upload
ethan-woolridge
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
RHIC, STAR computing towards distributed computing on the
Open Science Grid
Jérôme LAURETRHIC/STAR
Jérôme LAURETIWLSC, Kolkata India 2006 2
Outline
The RHIC program, complex and experiments An overview of the RHIC Computing facility
Expansion model Local Resources, remote usage
Disk storage, a “distributed” paradigm Phenix and STAR
STAR Grid program & tools SRM / DataMover SUMS GridCollector
Brief overview of the Open Science Grid STAR on OSG
Jérôme LAURETIWLSC, Kolkata India 2006 3
The RHIC program, complex and experiments An overview of the RHIC Computing facility
Expansion model Local Resources, remote usage
Disk storage, a “distributed” paradigm Phenix and STAR
STAR Grid program & tools SRM / DataMover SUMS GridCollector
Brief overview of the Open Science Grid STAR on OSG
Jérôme LAURETIWLSC, Kolkata India 2006 4
The Relativistic Heavy Ion Collider (RHIC) complex & experiments A world-leading scientific program
in Heavy-Ion and Spin program The Largest running NP
experiment Located in Long Island, New York,
USA Flexibility is key to understanding
complicated systems Polarize protons sqrt(s) = 50-500
GeV Nuclei from d to Au, sqrt(sNN) = 20-
200 GeV Physics runs to date
Au+Au @ 20, 62, 130, 200 GeV Polarized p+p @ 62, 200 GeV D+Au @ 200 GeV
RHIC
It is becoming the world leader in the scientific quest toward understanding how mass and spin combine into a coherent picture of the fundamental building blocks nature uses for atomic nuclei. It is also providing unique insight into how quark and gluons behaved collectively at the very first moment our universe was born.
Jérôme LAURETIWLSC, Kolkata India 2006 5
Complementary ExperimentsDiscovery and characterization of
the QGP
STARSTARSTAR
The experiments
PHENIXBRAHMS &PP2PPPHOBOS
STAR 1.2 kmRHIC
Jérôme LAURETIWLSC, Kolkata India 2006 6
The RHIC program, complex and experiments An overview of the RHIC Computing facility
Expansion model Local Resources, remote usage
Disk storage, a “distributed” paradigm Phenix and STAR
STAR Grid program & tools SRM / DataMover SUMS GridCollector
Brief overview of the Open Science Grid STAR on OSG
Jérôme LAURETIWLSC, Kolkata India 2006 7
The RHIC Computing Facility (RCF)
RHIC Computing Facility (RCF) at BNL Tier0 for the RHIC program
Online recording of Raw data Production reconstruction of all (most) Raw data Facility for data selection (mining) and analysis Long term archiving and serving of all data
… but not sized for Monte Carlo generation Equipment refresh funding (~25% annual replacement)
Addressing obsolescence Results in important collateral capacity growth
Jérôme LAURETIWLSC, Kolkata India 2006 8
Tier1, Tier2, … remote facilities
Remote Facilities Primary source of Monte Carlo data Significant analysis activity (equal in the case of STAR) Such sites are operational – the top 3
STAR• NERSC/PDSF, LBNL• Wayne State University• Sao Paolo
PHENIX• RIKEN, Japan• Center for High Performance Computing, University of New Mexico• VAMPIRE cluster, Vanderbilt University
Grid Computing Promising new direction in remote (distributed) computing STAR and, to a lesser extent, PHENIX are now active in Grid
computing
Jérôme LAURETIWLSC, Kolkata India 2006 9
Key sub-systems Mass Storage System
Hierarchical Storage Management by HPSS 4 StorageTek robotic tape silos ~4.5 PBytes 40 StorageTek 9940b tape drives ~1.2 GB/sec Change to technology to LTO drive this year
CPU Intel/Linux dual racked processor systems ~ 2300 CPU’s for ~1800 kSPECint2000 Mixed of Condor & LSF based LRMS
Central Disk 170 TBytes of RAID 5 storage Other storage solution: PANASAS, … 32 Sun/Solaris SMP NFS servers ~1.3 GByte/sec
Distribute disk ~ 400 TBytes x2.3 more than centralized storage !!!
Jérôme LAURETIWLSC, Kolkata India 2006 10
How does it look like …
Not like these … although …
Jérôme LAURETIWLSC, Kolkata India 2006 11
MSS, CPUs, Central Store
… but like these or similar(the chairs donot seem morecomfortable)
Jérôme LAURETIWLSC, Kolkata India 2006 12
Data recording rates
Run4 set a first record
STAR
PHENIX
120MBytes/sec
120MBytes/sec
Jérôme LAURETIWLSC, Kolkata India 2006 13
DAQ rates comparative
CDF
A very good talk from Martin Putchke - CHEP04Concepts and technologies used in contemporary DAQ systems
LHCbLHCb
ALICEALICE
CMSCMSATLASATLAS
~ 25
~ 100
~ 40
150 MB/sec
~ 300
~1250Heavy Ion Experimentsare in the > 100 MB/secrange
All in MB/sec, approximate ...
STAR moving to x 10 capabilities inouter years (2008+)
Jérôme LAURETIWLSC, Kolkata India 2006 14
Mid to long term computing needs
Computing projection model Goal is to estimate CPU, disk, mass storage and network
capacities Model based on raw data scaling Moore’s law used for cost recession
Feedback from the experimental groups Annual meetings, model refined if necessary (has been stable
for a while) Estimate based on beam use plans
May be offset by experiment, by year Integral consistent
Maturity factor for codes Number of reconstruction passes “richness” factor for the data (density of interesting events)
Jérôme LAURETIWLSC, Kolkata India 2006 15
Projected needs2005 2006 2007 2008 2009 2010 2011 2012
1800 2000 3700 8000 17000 27500 37700 62200
350 440 680 2000 3500 5600 7300 8100170 170 210 340 610 930 1200 140025 24 32 53 74 84 82 85
5 5 8 13 20 30 43 551400 1900 2800 3300 5200 6700 8000 8600
Tape (PBytes)Tape (MBytes/sec)
CPU (kSI2k)Distributed Disk (TBytes)Central Disk (Tbytes)Disk (GBytes/sec)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
FY05 FY06 FY07 FY08 FY09 FY10 FY11 FY12
Year
Tape Volume (TB)
STAR
Phenix
RCF Capacity Profile
-
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
2004 2005 2006 2007 2008 2009 2010 2011 2012
Running Year
Dis
k S
tora
ge
(TB
)
Cent. Disk (TBytes)
Distrib. Disk (TBytes)
Jérôme LAURETIWLSC, Kolkata India 2006 16
Discussion of model
Data amount is accurate at 20% close i.e. model was adjusted to 20% lower-end Upper-end has larger impact in the outer years DAQ1000 for STAR enabling Billions of event capabilities a
major (cost) factor driven by Physics demand
Cost will be beyond current provision Tough years start as soon as 2008 Gets better in the outer years (Moore’s law catches up) Uncertainties grows with time however …
Cost versus Moore’s law Implies “aggressive” technology upgrades (HPSS for example) Strategy heavily based on low cost distributed disk (cheap, CE
attached)
Jérôme LAURETIWLSC, Kolkata India 2006 17
The RHIC program, complex and experiments An overview of the RHIC Computing facility
Expansion model Local Resources, remote usage
Disk storage, a “distributed” paradigm Phenix and STAR
STAR Grid program & tools SRM / DataMover SUMS GridCollector
Brief overview of the Open Science Grid STAR on OSG
Jérôme LAURETIWLSC, Kolkata India 2006 18
Disk storage – distributed paradigm
Disk storage – distributed paradigm The ratio is striking x2.3 ratio now, moves
to x6 in outer years Requires SE strategy
CPU shortfall Tier1 use (Phenix, STAR) Tier2 user analysis and data on demand (STAR)
RCF Capacity Profile
-
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
2004 2005 2006 2007 2008 2009 2010 2011 2012
Running Year
Dis
k S
tora
ge
(TB
)
Cent. Disk (TBytes)
Distrib. Disk (TBytes)
Jérôme LAURETIWLSC, Kolkata India 2006 19
Phenix – dCache model
Tier1 / CC-J / RIKEN
dCache
Admin Node
Gri
dFT
PWorker Nodes
dCache
Admin Node
Gri
dFT
PWorker Nodes
MSS = HPSSCentral stores
Tier 0 – Tier 1 model Provides scalability for centralized
storage Smooth(er) distributed disk model
Jérôme LAURETIWLSC, Kolkata India 2006 20
Phenix – Data transfer to RIKEN
Network transfer rates of 700-750 Mbits/s could be achieved (i.e. ~90 MB/sec)
Jérôme LAURETIWLSC, Kolkata India 2006 21
STAR – SRM, GridCollector, XrootdDifferent approach Large (early) pool of
distributed disks, early adoption of dd model dd model too home-grown Did not scale well when mixing
dd and central disks Tier 0 – Tier X (X=1 or 2)
model Need something easy to deploy,
easy to maintain Leveraging on SRM
experience Data on demand Embryonic event level
(GridCollector) Xrootd could benefit from an
SRM back-end
olbdolbd
xrootdxrootd
olbdolbd
xrootdxrootd
olbdolbd
xrootdxrootd
olbdolbd
xrootdxrootd
olbdolbd
xrootdxrootd
olbdolbd
xrootdxrootd
Manager(Head Node)
Supervisor(Intermediate Node)
Data Server(Leaf Node)
Jérôme LAURETIWLSC, Kolkata India 2006 22
STAR dd Evolution – From this Where does this data go ??
D
D
D
D
DataCarousel
Client Scriptadds records Pftp on local disk
FileCatalog Management
Update FileLocationsMark {un-}availableSpider and update *
Control Nodes
VERY HOMEMADEVERY “STATIC”
Jérôme LAURETIWLSC, Kolkata India 2006 23
STAR dd Evolution – … to that … Entire layer for Cataloguing is goneLayer for restore from MSS to dd gone
D
D
D
D
Pftp on local disk
DATA ON DEMAND
XROOTD provides load balancing, possiblyscalability, a way to avoid LFN/PFNtranslation ...
But does NOT fit within our SRM invested directions ...
BUT IS IT REALLY SUFFICIENT !!??
Jérôme LAURETIWLSC, Kolkata India 2006 24
Coordination of requests needed
Un-coordinated requests to MSS is a disaster This applies to ANY SE-
related tools Gets worst if the
environment combine technologies (shared infrastructure)
Effect of performance is drastic
Jérôme LAURETIWLSC, Kolkata India 2006 25
The RHIC program, complex and experiments An overview of the RHIC Computing facility
Expansion model Local Resources, remote usage
Disk storage, a “distributed” paradigm Phenix and STAR
STAR Grid program & tools SRM / DataMover SUMS GridCollector
Brief overview of the Open Science Grid STAR on OSG
Jérôme LAURETIWLSC, Kolkata India 2006 26
STAR Grid program - Motivation
Tier0 production ALL EVENT files get copied on HPSS at the end of a production job Data reduction DAQ to Event to Micro-DST
All MuDST are on “disks” One copy temporarily on centralized storage (NFS), one permanently in HPSS Script checks consistency (job status, presence of files in one and the other) If “sanity” checks (integrity / checksum), register files in Catalog
Re-distribution If registered, MuDST may be “distributed”
Distributed disk on Tier0 sites Tier1 (LBNL) -- Tier2 sites (“private” resources for now)
Use of SRM since 2003 ...
Strategy implies dataset IMMEDIATE replication Allows balancing of analysis Tie0 to Tier1 Data on demand enable Tier2 with capabilities
Jérôme LAURETIWLSC, Kolkata India 2006 27
Needed for immediate exploitation of resources Short / medium term strategy
To distribute data Take advantage of the static data (schedulers, workflow, …)
Advanced strategies Data-on-demand (planner, dataset balancing, data placement …) Selection of sub-sets of data (datasets of datasets, …) Consistent strategy (interoperability?, publishing?, )
Less naïve considerations Job Tracking Packaging Automatic Error recovery, Help desk Networking Advanced workflow, …
SRM / DataMover
STAR Unified Meta-Scheduler
Xrootd, …
GridCollectorSRM back-endsWould enableXrootd with Object on demand
Will leverage existingto come to existencemiddleware or addressone by one …
Jérôme LAURETIWLSC, Kolkata India 2006 28
SRM / DataMoverSRMs are middleware components whose function is to provide SRMs are middleware components whose function is to provide
dynamicdynamic space allocationspace allocation file managementfile management of shared storage of shared storage components on the Gridcomponents on the Grid
SRM SRM SRM
Enstore JASMine
ClientUSER/APPLICATIONS
Grid Middleware
SRM
dCache
SRM
Castor
SRM
Unix-baseddisks
SRM
SE
CCLRC RAL
http://osg-docdb.opensciencegrid.org/0002/000299/001/GSM-WG-GGF15-SRM.pp t
Jérôme LAURETIWLSC, Kolkata India 2006 29
SRM / DataMover Layer on top of SRM In use for BNL/LBNL data transfer for years
All MuDST moved to Tier1 this way Extremely reliable
“Set it, and forget it !” Several 10k files transferred, multiple TB for days, no losses
Project was (IS) extremely useful, production usage in STAR Data availability at remote site as it is produced
We need this NOW Faster analysis is better science and sooner Data safety
Caveat/addition in STAR: RRS (Replica Registration Service) 250k files, 25 TB transferred AND Catalogued 100% reliability Project deliverables on-time
Jérôme LAURETIWLSC, Kolkata India 2006 30
SRM / DataMover – Flow diagram
SRM-COPY(thousands of files)SRM-COPY(thousands of files)
SRM-GET (one file at a time)SRM-GET (one file at a time)
GridFTP GET (pull mode)GridFTP GET (pull mode)
stage filesstage filesarchive filesarchive files
Network transferNetwork transfer
Get listof filesFrom directory
Get listof filesFrom directory
DiskCacheDisk
Cache
DataMover(Command-line Interface)
HRM(performs writes)
LBNL
DiskCacheDisk
Cache
HRM(performs reads)
BNL
BNL LBNL
BNLFile Catalog
BNL FCMirror
LBNLFC
LBNL FCMirror
MySQL
MySQL
RRS
read
write
Files/Datasets Files/DatasetsHRM
Being deployed at Wayne State University Sao Paolo
DRM used in data analysis scenario as light weight SE service (deployable on the fly) All the benefits from SRM
(advanced reservation, …) If we know there IS a storage
space, we can take it No heavy duty SE deployment
NEW
Jérôme LAURETIWLSC, Kolkata India 2006 31
CE/SE decoupling
Srm-copy from execution site DRM back to submission site Submission site DRM is called from execution site WN
Requires outgoing, but not incoming, connection on the WN Srm-copy callback disabled (asynchronous transfer) Batch slot released immediately after srm-copy call Final destination of files is HPSS or disk, owned by user
DRM DRM
Client
Client
Client
/scratch
/scratch
/scratch
.
.
.DRM cache DRM cache
Submission Site Job execution Site
Jérôme LAURETIWLSC, Kolkata India 2006 32
SUMS – The STAR Unified Meta-Scheduler STAR Unified Meta-Scheduler
Gateway to user batch-mode analysis User writes an abstract job description Scheduler submits where files are, where CPU is, ... Collects usage statistics User DO NOT need to know about the RMS layer
Dispatcher and Policy engines DataSet drive - Full catalog implementation & Grid-aware Throttles IO resources, avoid contentions, optimizes on
CPU
/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......
sched1043250413862_1.list / .csh
/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......
sched1043250413862_2.list / .csh
<job maxFilesPerProcess="500"> rootMacros/numberOfEventsList.C\
<stdout
/> <input
etype=daq_reco_mudst" preferStorage="local" nFiles="all"/> toURL="file:/star/u/xxx/scheduler/out/" />
/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......
sched1043250413862_0.list / .csh
/star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie.../star/data09/reco/productionCentral/FullFie......
Query/Wildcardresolution
<?xml version="1.0" encoding="utf-8" ?>
<command>root4star -q -b
(\"$FILELIST\"\)</command>
URL="file:/star/u/xxx/scheduler/out/$JOBID.out"
URL="catalog:star.bnl.gov?production=P02gd,fil
<output fromScratch="*.root"
</job>
Job descriptiontest.xml
Avoid specifying datalocation …
Jérôme LAURETIWLSC, Kolkata India 2006 33
SUMS – The STAR Unified Meta-Scheduler STAR Unified Meta-Scheduler
Gateway to user batch-mode analysis User writes an abstract job description Scheduler submits where files are, where CPU is, ... Collects usage statistics User DO NOT need to know about the RMS layer
Dispatcher and Policy engines DataSet drive - Full catalog implementation & Grid-aware Throttles IO resources, avoid contentions, optimizes on
CPU
BEFORE – VERY choppyAs NFS would impact computational performances
AFTER modulo remainingfarm instability, smoother
Jérôme LAURETIWLSC, Kolkata India 2006 34
SUMS – The STAR Unified Meta-Scheduler, the next generation …
NEW FEATURES RDL in addition of U-JDL Testing grid submission is OVER.
SUMS is production and user analysis ready
Light SRM helping tremendously Need scalability test
Made aware of multiple packaging methods (from ZIP archive to PACMAN)
Tested for simple analysis, need finalizing mixed archiving technology (detail)
Versatile configuration Site can “plug-and-play” Possibility of Multi-VO support
within ONE install
An issue since we have multi10k jobs/day NOW with spikesat 100k (valid) jobs from nervoususers …
Jérôme LAURETIWLSC, Kolkata India 2006 35
GridCollectorUsing an Event Catalog to Speed up User Analysis in Distributed Environment
STAR – event catalog … Based on TAGS produced at reco time Rest on now well tested and robust SRM
(DRM+HRM) deployed in STAR anyhow Immediate Access and managed SE Files moved transparently by delegation to SRM
service BEHING THE SCENE Easier to maintain, prospects are enormous
“Smart” IO-related improvements and home-made formats no faster than using GridCollector (a priori)
• Physicists could get back to physics• And STAR technical personnel better off
supporting GC
It is a WORKING prototype of Grid interactive analysis framework VERY POWERFULL Event “server” based (no
longer files)
1
2
3
4
5
6
0.01 0.1 1
selectivity
sp
ee
du
p
elapsed CPU
GAIN ALWAYS > 1, regardlessof selectivity
root4star -b -q doEvents.C'(25,"select MuDst where Production=P04ie \ and trgSetupName=production62GeV and magScale=ReversedFullField \ and chargedMultiplicity>3300 and NV0>200", "gc,dbon")'
Jérôme LAURETIWLSC, Kolkata India 2006 36
GridCollector – The next step
Can push functionalities “down” Index BitMap technology in ROOT framework Make “a” coordinator “aware” of events (i.e. objects)
Xrootd a good candidate ROOT framework preferred Both would serve as a demonstrator
(immediate benefit to a few experiments …)
Object-On Demand: from files to Object Management - Science Application Partnership (SAP) – SciDAC-II In the OSG program of work as leveraging technologies to
achieve goals
Jérôme LAURETIWLSC, Kolkata India 2006 37
The RHIC program, complex and experiments An overview of the RHIC Computing facility
Expansion model Local Resources, remote usage
Disk storage, a “distributed” paradigm Phenix and STAR
STAR Grid program & tools SRM / DataMover SUMS GridCollector
Brief overview of the Open Science Grid STAR on OSG
Jérôme LAURETIWLSC, Kolkata India 2006 38
The Open Science Grid
In the US, Grid is moving to the Open Science-Grid An interesting adventure comparable similar European efforts EGEE interoperability at its heart
Character of OSG Distributed ownership of resources. Local Facility policies, priorities, and capabilities need to
be supported. Mix of agreed upon performance expectations and opportunistic resource use. Infrastructure deployment based on the Virtual Data Toolkit. Will incrementally scale the infrastructure with milestones to support stable running of mix of
increasingly complex jobs and data management.
Peer collaboration of computer and application scientists, facility, technology and resource providers “end to end approach”.
Support for many VOs from the large (thousands) to the very small and dynamic (to the single researcher & high school class)
Loosely coupled consistent infrastructure - “Grid of Grids”.
Jérôme LAURETIWLSC, Kolkata India 2006 39
STAR and the OSG STAR could not run on Grid3
Was running at PDSF, a Grid3 site setup in collaboration with our resources
STAR on OSG = a big improvement OSG for an Open Science, not as
strongly LHC sole focus Expanding to other science: revisit of
needs and requirements More resources Greater stability
Currentely Run MC on regular basis (nightly tests,
standard MC) Recently focused on user analysis (light
weight SRM) Helped other site deploy OSG stack
And it shows …
FIRST Functional site in Brazil,Universidade de Sao Paolo, aSTAR institution …
http://www.interactions.org/sgtw/2005/0727/star_saopaulo_more.html
Jérôme LAURETIWLSC, Kolkata India 2006 40
Summary RHIC computing facility provides adequate resources in the
short-term Model imperfect for long term projections Problematic years starting 2008, driven by high data throughput &
physics demands Mid-term issue
This will impact Tier1 as well, assuming a refresh and planning along the same model
Out-sourcing?
Under data “stress” and increasing complexity RHIC experiments have integrated at one level or another distributed
computing principles Data distribution and management Job scheduling, selectivity, … STAR intends to
Take full advantage of OSG & help bring more institutions into the OSG Address the issue of batch oriented user analysis (opportunistic, …)