Computing for Hall D

[email protected] 1

Computing for Hall DComputing for Hall DComputing for Hall DComputing for Hall D

Ian Bird

Hall D Collaboration MeetingMarch 22, 2002

Data Volume per experiment per year (Raw data - in units of 109 bytes)

100

1000

10000

100000

1000000

1980 1990 2000 2010

E691

E665

E769

E791

CDF/ D0

KTeV

E871

BABAR

CMS/ ATLAS

E831

ALEPH

J LAB

STAR/ PHENI X

NA48

ZEUS

But: collaboration sizes!

[email protected] 3

Technologies

• Technologies are advancing rapidly– Compute power– Storage – tape and disk– Networking

• What will be available 5 years from now?– Difficult to predict – but it will not be a problem to provide any of the

resources that Hall D will need….

– E.g computing:

[email protected] 4

Recently, 5 TB IDE cache disk (5 x 8u) per 19”

Intel Linux Farm

First purchases, 9 duals per 24” rack

FY00, 16 duals (2u) + 500 GB cache (8u) per 19” rack

FY01, 4 CPU per 1u

[email protected] 5

Compute power

• Blades– Low power chips

• Transmeta, Intel

– Hundreds in a single rack

• “An RLX System 300ex chassis holds twenty-four ServerBlade 800i units in a single 3U chassis. This density achievement packs 336 independent servers into a single 42U rack, delivering 268,800 MHz, over 27 terabytes of disk storage, and a whopping 366 gigabytes of DDR memory. “

[email protected] 6

Technologies

• As well as computing, developments in Storage and Networking will also make rapid progress

• Grid computing techniques will bring these technologies together

• Facilities – new Computer Center planned

• Issues will not be technology, but:– How to use them intelligently– Hall D computing model– People– Treating computing seriously enough to assign sufficient resources

[email protected] 7

(Data-) Grid Computing(Data-) Grid Computing(Data-) Grid Computing(Data-) Grid Computing

[email protected] 8

Particle Physics Data GridCollaboratory Pilot

Who we are:Four leading Grid Computer Science Projects

andSix international High Energy and Nuclear Physics Collaborations

What we do:Develop and deploy Grid Services for our Experiment Collaborators

andPromote and provide common Grid software and standards

The problem at hand today:Petabytes of storage, Teraops/s of computing

Thousands of users, Hundreds of institutions,

10+ years of analysis ahead

[email protected] 9

PPDG Experiments

ATLAS - a Toroidal LHC ApparatuS at CERN Runs 2006 onGoals: TeV physics - the Higgs and the origin of mass …

http://atlasinfo.cern.ch/Atlas/Welcome.html

BaBar - at the Stanford Linear Accelerator Center Running

NowGoals: study CP violation and more

http://www.slac.stanford.edu/BFROOT/

CMS - the Compact Muon Solenoid detector at CERN Runs 2006

onGoals: TeV physics - the Higgs and the origin of mass …

http://cmsinfo.cern.ch/Welcome.html/

D0 – at the D0 colliding beam interaction region at Fermilab Runs SoonGoals: learn more about the top quark, supersymmetry, and the Higgs

http://www-d0.fnal.gov/

STAR - Solenoidal Tracker At RHIC at BNL Running

NowGoals: quark-gluon plasma …

http://www.star.bnl.gov/

Thomas Jefferson National Laboratory Running

NowGoals: understanding the nucleus using electron beams …

http://www.jlab.org/

[email protected] 10

PPDG Computer Science Groups

Condor – develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing on large collections of computing resources with distributed ownership.

http://www.cs.wisc.edu/condor/

Globus - developing fundamental technologies needed to build persistent environments that enable software applications to integrate instruments, displays, computational and information resources that are managed by diverse organizations in widespread locations

http://www.globus.org/

SDM - Scientific Data Management Research Group – optimized and standardized access to storage systems

http://gizmo.lbl.gov/DM.html

Storage Resource Broker - client-server middleware that provides a uniform interface for connecting to heterogeneous data resources over a network and cataloging/accessing replicated data sets.

http://www.npaci.edu/DICE/SRB/index.html


Delivery of End-to-End Applications& Integrated Production Systems

to allow thousands of physicists to share data & computing resources for scientific processing and analyses

PPDG Focus:

- Robust Data Replication

- Intelligent Job Placement and Scheduling

- Management of Storage Resources

- Monitoring and Information of Global Services

Relies on Grid infrastructure:- Security & Policy- High Speed Data Transfer- Network management

Resources: Computers, Storage, Networks

Operators & Users


Project Activities, End-to-End Applicationsand Cross-Cut Pilots

Project Activities are focused Experiment – Computer Science Collaborative developments.

Replicated data sets for science analysis – BaBar, CMS, STARDistributed Monte Carlo production services – ATLAS, D0, CMSCommon storage management and interfaces – STAR, JLAB

End-to-End Applications used in Experiment data handling systems to give real-world requirements, testing and feedback.

Error reporting and responseFault tolerant integration of complex components

Cross-Cut Pilots for common services and policies Certificate Authority policy and authenticationFile transfer standards and protocolsResource Monitoring – networks, computers, storage.


Year 0.5-1 Milestones (1)

Align milestones to Experiment data challenges:

– ATLAS – production distributed data service – 6/1/02

– BaBar – analysis across partitioned dataset storage – 5/1/02

– CMS – Distributed simulation production – 1/1/02

– D0 – distributed analyses across multiple workgroup clusters – 4/1/02

– STAR – automated dataset replication – 12/1/01

– JLAB – policy driven file migration – 2/1/02


Year 0.5-1 Milestones

Common milestones with EDG:

GDMP – robust file replication layer – Joint Project with EDG Work Package (WP) 2 (Data Access)

Support of Project Month (PM) 9 WP6 TestBed Milestone. Will participate in integration fest at CERN - 10/1/01

Collaborate on PM21 design for WP2 - 1/1/02

Proposed WP8 Application tests using PM9 testbed – 3/1/02

Collaboration with GriPhyN:

SC2001 demos will use common resources, infrastructure and presentations – 11/16/01

Common, GriPhyN-led grid architecture

Joint work on monitoring proposed


Year ~0.5-1 “Cross-cuts”

• Grid File Replication Services used by >2 experiments:– GridFTP – production releases

• Integrate with D0-SAM, STAR replication• Interfaced through SRB for BaBar, JLAB• Layered use by GDMP for CMS, ATLAS

– SRB and Globus Replication Services• Include robustness features• Common catalog features and API

– GDMP/Data Access layer continues to be shared between EDG and PPDG.

• Distributed Job Scheduling and Management used by >1 experiment:• Condor-G, DAGman, Grid-Scheduler for D0-SAM, CMS• Job specification language interfaces to distributed schedulers – D0-SAM,

CMS, JLAB

• Storage Resource Interface and Management• Consensus on API between EDG, SRM, and PPDG• Disk cache management integrated with data replication services


Year ~1 other goals:

• Transatlantic Application Demonstrators:– BaBar data replication between SLAC and IN2P3– D0 Monte Carlo Job Execution between Fermilab and NIKHEF– CMS & ATLAS simulation production between Europe/US

• Certificate exchange and authorization.– DOE Science Grid as CA?

• Robust data replication.– fault tolerant – between heterogeneous storage resources.

• Monitoring Services– MDS2 (Metacomputing Directory Service)?– common framework– network, compute and storage information made available to scheduling and resource management.


PPDG activities as part of the Global Grid Community

Coordination with other Grid Projects in our field:GriPhyN – Grid for Physics NetworkEuropean DataGridStorage Resource Management collaboratoryHENP Data Grid Coordination Committee

Participation in Experiment and Grid deployments in our field:ATLAS, BaBar, CMS, D0, Star, JLAB experiment data handling systemsiVDGL/DataTAG – International Virtual Data Grid LaboratoryUse DTF computational facilities?

Active in Standards Committees:Internet2 HENP Working Group Global Grid Forum


What should happen now?

• Collaboration needs to define it’s computing model– It really will be distributed – grid based– Although the compute resources can be provided – it is not obvious that

the vast quantities of data can really be analyzed efficiently by a small group

• Do not underestimate the task

– The computing model will define requirements for computing – some of which may require some lead time

• Ensure software and computing is managed as a project equivalent in scope to the entire detector – It has to last at least as long, it runs 24x365– The complete software system is more complex than the detector, even for

Hall D where the reconstruction is relatively straightforward– It will be used by everyone

• Find and empower a computing project manager now

Documents

Computing for Hall D