15
28 April 2003 Lee Lueking, PPDG Review 1 BaBar and DØ Experiment Reports DOE Review of PPDG January 28-29, 2003 Lee Lueking Fermilab Computing Division D0 liaison to PPDG

BaBar and D Ø Experiment Reports

Embed Size (px)

DESCRIPTION

BaBar and D Ø Experiment Reports. DOE Review of PPDG January 28-29, 2003 Lee Lueking Fermilab Computing Division D0 liaison to PPDG. BaBar Introduction D Ø. D Ø’ s PPDG effort concentrating on: Data Distribution on the Grid (SAM). - PowerPoint PPT Presentation

Citation preview

28 April 2003 Lee Lueking, PPDG Review 1

BaBar and DØ Experiment Reports

DOE Review of PPDG

January 28-29, 2003

Lee Lueking

Fermilab Computing Division

D0 liaison to PPDG

28 April 2003 Lee Lueking, PPDG Review 2

BaBar Introduction DØ BaBar's PPDG effort concentrating on:

– Data Distribution on the Grid (SRB, Bdbserver++).

– Job submission on the Grid (EDG,LCG).

People involved:– Tim Adye (RAL)– Andy Hanushevsky (SLAC)– Adil Hasan (SLAC)– Wilko Kroeger (SLAC).

Interactions with other Grid efforts that are part of BaBar:

– GridPP (UK), EDG (Europe through Dominique Boutigny), GridKA, Italian Grid groups etc.

BaBar Grid applications are being designed to be data-format neutral

– BaBar's new computing model should have little impact on the apps.

DØ’s PPDG effort concentrating on:– Data Distribution on the Grid

(SAM).– Job submission on the Grid

(JIM w/Condor-G and Globus). People involved:

– Igor Terekhov (FNAL; JIM Team Lead)

– Gabriele Garzoglio (FNAL)

– Andrew Baranovski (FNAL)

– Parag Mhashilkar & Vijay Murthi (via Contr. w/ UTA CSE)

– Lee Lueking (FNAL; D0 Liaison to PPDG)

Interactions with other Grid efforts that are part of D0:

– GridPP (UK), GridKA (DE), NIKHEF (NL), CCIN2P3 (FR)

Very closely working with the Condor team to achieve

– Grid Job & Resource Matchmaking service

– Other robustness and usability features

28 April 2003 Lee Lueking, PPDG Review 3

Overview of BaBar and DØ Data Handling

Regional CenterAnalysis site

DØ Integrated Files ConsumedMar’02 to Mar‘03

DØ Integrated Data Consumed Mar’02 to Mar‘03

4.0 M Files

1.2 PB

Mar2002 Mar2003

Both experiments have extensive distributed computing and data handling systems Significant amounts of data are processed at remote sites in the US and Europe

Apr May J un J ul Aug Sep Oct Nov Dec J an Feb Mar0

20000

40000

60000

80000

100000

120000

140000

160000

J an Feb Mar Apr May J un J ul Aug Sep Oct Nov Dec0

100

200

300

400

500

600

700

800

BaBar Database Growth (TB) Jan'02 to Dec'02

BaBar Analysis Jobs (SLAC) Apr'02 to Mar'03

730 TB

140k Jobs

DØ SAM Deployment

BaBar Deployment

Tier A CentersMonte Carlo

28 April 2003 Lee Lueking, PPDG Review 4

BaBar Bulk Data Distribution – SRB

Storage Resource Broker (SRB) from SDSC being used to test out data distribution from Tier A to Tier A with view to production this summer.

So far have had 2 successful demos at Super Computing 2001 (SLAC->SLAC), 2002 (SLAC->ccin2p3).

Have been testing SRB V2 (released Feb 2003), new features Bulk registering in RDBMS, parallel stream file replication.

Busy incorperating newly designed BaBar metadata tables to SRB's RDBMS tables. Looking to improve file replication performance (playing with streams, etc).

28 April 2003 Lee Lueking, PPDG Review 5

BaBar User-driven data distribution: BdbServer++

Attempts to address use-case: user wants to copy a collection of sparse events with little space overhead (mainly Tier A to Tier C).

BdbServer++ essentially a set of scripts that:

– Submit a job to the Grid to make a deep-copy of the sparse collection (ie copy objects for events of interest only).

– Then copy the files back to user's institution through Grid (can use globus-url-copy).

– Poster at CHEP2003 Currently have tested Deep-copy through the grid using EDG and pure

Globus. Just completed test of extracting data using globus-url-copy (pure Globus request).

To do: incorperate with BaBar bookeeping. Robustness, reliability tests, production-level scripts for submission, copying.

28 April 2003 Lee Lueking, PPDG Review 6

BaBar Job Submission on the Grid Many production-like activities could take advantage of using compute resources at more

than one site.– Analysis Production: ccin2p3 (France), UK, SLAC – using EDG installations.– Simulation Production: Ferrara (Italy) Grid Group, Ohio – using EDG and VDT

installations.– Also very useful for data distribution (BdbServer++), ccin2p3 (France), SLAC.

Proposed BaBar Grid Architecture

28 April 2003 Lee Lueking, PPDG Review 7

BaBar Job Submission on the Grid There was a CHEP 2003 talk and Poster, a grid demo set up in UK

(run BaBar jobs on UK grid) and have managed to run Simulation Production and data distribution tests on Grid.

Plan: test new EDG2/LCG installations, increase users as releases stabilize.

BbgUtils.pl – perl script to allow easier client-side installation of Globus + CA's (currently works for Sun, Linux).

– Script copies all tar files and signing-policies etc necessary for client installation for that expt.

– Can be readily extended to include SRB client-side installation, EDG/LCG client side installation, etc.

28 April 2003 Lee Lueking, PPDG Review 8

DØ Objectives of SAMGrid

Bring standard grid technologies (including Globus and Condor) to the Run II experiments.

Enable globally distributed computing for DØ and CDF. JIM (Job and Information Management) complements SAM by adding job

management and monitoring to data handling. Together, JIM + SAM = SAMGrid

28 April 2003 Lee Lueking, PPDG Review 9JO

B

Computing Element

Submission Client

User Interface

QueuingSystem

JIM Job ManagementUser

Interface

User Interface

BrokerMatch

Making Service

Information Collector

Execution Site #1

Submission Client

Submission Client

Match Making Service

Match Making Service

Computing Element

Grid Sensors

Execution Site #n

Queuing System

Queuing System

Grid Sensors

Storage Element

Storage Element

Computing Element

Storage Element

Data Handling System

Data Handling System

Storage Element

Storage Element

Storage Element

Storage Element

Information Collector

Information Collector

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Computing Element

Computing Element

Data Handling System

Data Handling System

Data Handling System

Data Handling System

28 April 2003 Lee Lueking, PPDG Review 10

DesktopAnalysis Stations

InstitutionalAnalysis Centers

Normal InteractionCommunication PathOccasional Interaction Communication Path

RegionalAnalysis Centers

Central Analysis Center (CAC)

DAS DAS…. DAS DAS….

IAC...

IAC IAC…

IAC

RAC….

RAC

A site can join SAM-Grid with combos of services:– Monitoring, and/or– Execution, and/or– Submission

May 2003: Expect 5 initial execution sites for SAMGrid deployment, and 20 submission sites.

– GrkdKa (Karlsruhe) – Analysis site– Imperial College and Lancaster – MC sites– U. Michigan (NPACI) – Reconstruction center.– FNAL - CLueD0 as a submission site.

Summer 2003: Continue to add execution and submission sites. Second round of execution site deployments include Lyon (ccin2p3), Manchester, MSU, Princeton, UTA, FNAL – CAB system.

Hope to grow to dozens execution and hundreds of submission sites over next year(s).

Use grid middleware for job submission within a site too!– Administrators will have general ways of managing resources. – Users will use common tools for submitting and monitoring jobs

everywhere.

DØ JIM Deployment

28 April 2003 Lee Lueking, PPDG Review 11

What’s Next for SAMGrid?After JIM version 1

Improve scheduling jobs and decision making. Improved monitoring, more comprehensive, easier to navigate. Execution of structured jobs Simplifying packaging and deployment. Extend the configuration and

advertising features of the uniform framework built for JIM that employs XML.

CDF is adopting SAM and SAMGrid for their Data Handling and Job Submission. CDF also has asked to join PPDG.

Interoperability, interoperability, interoperability

– Working with EDG and LCG to move in common directions

– Moving to Web services, Globus V3, and all the good things OGSA will provide. In particular, interoperability by expressing SAM and JIM as a collection of services, and mixing and matching with other Grids

28 April 2003 Lee Lueking, PPDG Review 12

Challenges

Meeting the challenges of real data handling and job submission BaBar and DØ have confronted real-life issues, including…

Troubleshooting is an important and time consuming activity in distributed computing environments, and many tools are needed to do this effectively.

Operating these distributed systems on a 24/7 basis involves coordination, training, and worldwide effort.

Standard middleware is still hard to use, and requires significant integration, testing, and debugging.

– File replication integrity

– Preemptive distributed caching

– Private networks

– Routing data in a worldwide system.

– Reliable network file transfers, timeouts, and retries

– Simplifying complex installation procedures

– Username clashing issues, moving to GSI and Grid Certificates

– Interoperability with many MSS.

– Security issues, firewalls, site policies

– Robust job submission on the grid

28 April 2003 Lee Lueking, PPDG Review 13

28 April 2003 Lee Lueking, PPDG Review 14

PPDG Benefits to BaBar and DØ

PPDG has provided very useful collaboration with, and feedback to, other Grid and Computer Science Groups.

Development of tools and middleware that should be of general interest to the Grid community, e.g.

– BbgUtils.pl

– Condor-G enhancements Deploying and testing grid middleware under battlefield conditions of

operational experiments hardens the software and helps CS learn what is needed.

The CS groups enable the experiments to examine problems in new, innovative ways, and provide important new technologies for solving them.

28 April 2003 Lee Lueking, PPDG Review 15

The End