Upload
sabina-sanders
View
215
Download
1
Embed Size (px)
Citation preview
1R. Pordes, Atlas Software Mtg, May 6th 2002
A snapshot of PPDG Status and Plans
Ruth PordesA PPDG and iVDGL CoordinatorComputing Division, Fermilab
PPDG “Vertical Integration” of Grid middleware components into HENP experiments’ ongoing work
Laboratory for “experimental computer science”
Common and Reused Components
2R. Pordes, Atlas Software Mtg, May 6th 2002
A snapshot of PPDG
Summary
Status
Slow Ramp up of new activity: CS-11 , Analysis Tools Cross-Cut activity
Other near term plans
(With thanks to many people who gave the information.)
3R. Pordes, Atlas Software Mtg, May 6th 2002
The running experiments are very
cautious about introducing anything new –
Experiment Location # physicists Time scale
BaBar SLAC 800 1999 - 2010
STAR BNL / RHIC 450 2000 - 2010
Jlab/CLASJlab/QCD(theory)
JLAB 20030
2000 – 20102003?
D0-Run2 FNAL 800 2001 - 2010
ATLAS CERN 2000 2007 – 2016
CMS CERN 2000 2007 - 2016
A Summary and a Reminder:
PPDG experiments include those taking data now as well as the LHC experiments
4R. Pordes, Atlas Software Mtg, May 6th 2002
Application Grid Infrastructure:
Grid Middleware:
Experiment Data Processing
Applications
Monitors, Reporters,Diagnostics
CS-10: Experiment Production Grids
Grid Middleware:CS-5: Reliable File TransferCS-2: Job SchedulingCS-6: Data Replication ServicesCS-3: Monitoring FrameworkCS-9: Authentication and authorization
System Managers, Controllers
Application Grid Infrastructure:CS-1: Job Definition Language and InterfaceCS-11: Analysis ToolsCS-9: Virtual Organization frameworkData Delivery and Access framework – (CS-12)Experiment Catalogs – (CS-12)Error and Diagnosis framework – (CS-13)Data Definition and Management – (CS-12)CS-2 Workload Management
User Analysis Programs
Fabric:
Fabric:CS-4: Storage nodesCS-3: Monitoring Information ProvidersDatabases/ObjectsCompute nodes NetworksCS-9: Security
6R. Pordes, Atlas Software Mtg, May 6th 2002
SciDAC - encouragement to collaborate
SciDAC Project
Earth System Grid II
Collaboratory for Multi-Scale Chemical Science
National Fusion Collaboratory
DOE Science Grid Collaborative development of CA, RA policies and continued work on PMA etc
Pervasive Collaborative Computing Environment and Reliable and Secure Group Communication
A High-Performance Data Grid Toolkit Much Globus (ANL and ISI) developments being used by PPDG.
CoG Middleware Application developers starting to get some interest in higher level language interfaces to Globus
Scientific Annotation Middleware
Storage Resource Management for Data Grid Applications Co-collaborators with PPDG. Development of SRM interfaces done in collaboration.
Middleware to Support Group to Group Collaboration
Distributed Security Architectures
Security and Policy for Group Collaboration SiteAAA working closely with Globus CAS
Scientific Data Management Application scientist from STAR working with the collaboration. Expect more interaction on Analysis Tools.
Middleware Technology to Support Science Portals In prototype use by Grappa?
Optimizing Performance and Enhancing Functionality of Distributed Applications Using Logistical Networking
Exploring ways to work together – but to date not found a mutually beneficial task.
Bandwidth Estimation: Measurement Methodologies and Applications
Endorsed PPDG collaboration with SLAC.
Advanced Computing for 21st Century Accelerator Science and Technology
A National Computational Infrastructure for Lattice Gauge Theory
JLAB collaborators on PPDG are delivering prototype Grid applications for their user
Shedding New Light on Exploding Starts: Terascale Simulations of Neutrino-Driven SuperNovae and their NucleoSynthesis (ps file)
SciDAC Center for Supernova Research
7R. Pordes, Atlas Software Mtg, May 6th 2002
SciDAC CoG Kits
Impact and ConnectionsIMPACT. Allow application developers to make use of Grid services from higher-
level frameworks such as Java and Python. Easier development of advanced Grid services. Easier and more rapid application development. Encourage code reuse, and avoid duplication of effort amongst the
collaboratory projects. Encourage the reuse of Web Services as part of the Grids.
CONNECTIONS: We are working closely or as part of with the Globus research project, we work with a variety of major funded applications through SciDAC, NSF, en EU grants, E.g. DOE Science Grid, Earth Systems Grid, Supernova Factory, NASA IPG.
The Novel Ideas• Develop a common set of reusable components for accessing Grid services. • Focus on supporting the rapid development of Science Portals, Problem Solving Environments, and science applications that access Grid resources.• Develop and deploy a set of “Web Services” that access underlying Grid services.
• Integrate the Grid Security Infrastructure (GSI) into the “Web Services” model. Provide access to higher level Grid services that are language independent and are described via commodity Web technologies such as WSDL..
Principal Investigators: Gregor von Laszewski, ANL Keith. Jackson, LBL 09/07/2001
MICS/SciDAC Program Name
MICS Program Manager: Marry Ann Scott
GlobusToolkit
Java based Grid Portals and Applications
Java CoG ToolkitPythonCoG
Toolkit
CommodityPython
Tools andServices
CommodityJava
Tools and Services
PortalHigh
Energy Physics
Biology PSEChemistryPython
IDEEarthScience
JavaIDE
Java Distributed Programming
Framework
Java CoG Globus Service
…
…
Milestones/Dates/Status The main goal of this project is to create Software Development Kits in both Java and Python that allow easy access to Grid services.
Provide access to basic Grid services: Year - GRAM, MDS, Security, GridFTP 1 - Replica Catalog, co-scheduling 1&2 Composable Components: - Develop guidelines for component development 1 - Design and implement component hierarchies 1&2 - Develop a component repository 2&3 Web Services: - Integrate GSI 1 - Develop an initial set of useful web services 1&2
Composable CoG Components
9R. Pordes, Atlas Software Mtg, May 6th 2002
PPDG Status: Experiment End-to-End Grids + Common Services to date:
ATLAS – MAGDA - GSI, MDS, (GDMP)
BaBar – Babar data handling system - SRB in prototype
CMS – IMPALA/MOP – GSI, Condor-G, Gram, DAGMAN, MDS
D0 – SAM – GridFTP, (GSI), (MDS), (Condor-G)
JLAB – JASMINE, Replica Catalog Portal
STAR – STACS – HRM, GridFTP
SiteAAA – working to ensure CAS can be used by all experiments. Currently PPDG using EDG VO mechanisms.
All experiments expect to demo at SC2002 – makes a good milestone! These are a valuable part of PPDG which enable the results of the work to be not only demo’d but introduced into the actual experiment running systems.
10R. Pordes, Atlas Software Mtg, May 6th 2002
Con
dor-
G D
AG
Man
IMPA
LA-M
OP stage-in DAR
declareconnect to RefDB
run wrapper script
GDMP publish andtransfer data files
run
create
error filterupdate RefDB
CERNRefDB
Step 1: submit/install DAR file to remote sites Step 2: submit all CMKIN jobs Step 3: submit all CMSIM jobs
Assigned 200 K events to test MOP Finished CMKIN part Started CMSIM part
stage-in cmkin/cmsim
wrapper scripts
11R. Pordes, Atlas Software Mtg, May 6th 2002
Lessons Learned to date:
• 5 site production grid• need:
— “grid-wide” debugging• ability to log into a remote site and talk to the System
Manager over the phone proved vital...• remote logins & telephone calls not a scalable
solution!
— site configuration monitoring• how are Globus, Condor, etc configured?• what does the GDMP export/import-catalog say?• Florida and Fermilab currently post info on web• should be monitored by standard monitoring tools?
— programmers to write very robust code!
12R. Pordes, Atlas Software Mtg, May 6th 2002
PPDG Status:Computer Science Groups software
extensions and integration:
Globus – GSI, GridFTP, Replica Catalog to support GDMP, CAS modifications
Condor – Classads call outs, Matchmaking at Condor-G level
SRB – extension of MCAT, interface to GridFTP, support for multiple catalogs
SRM – Extensions to HRM/DRM for control and error returns.
Soon ready to test: Reliable File Transfer layer, new Replica Location Services, Glue Schema,
(plus hardening as s/w used under new conditions and stress)
13R. Pordes, Atlas Software Mtg, May 6th 2002
End of PPDG First Year – Internal Reviews
Project:GDMPD0JobManagementJLAB-Replication MAGDASTAR-DDMCMS-MOPBaBar Database Replication
Questions asked:• What are the deliverables of your project
activity, how has the project met the deliverables to date, what effort has been contributing to the project?
• What is the deployment plan for your project activity and what is the state of that deployment?
• Has the project benefited from being part of the PPDG work and if so how?
• Has the project been hindered by being part of the PPDG work and if so how?
• What collaborations does your project activity rely on and/or contribute to? Have these been of benefit or a hindrance?
• What is your assessment of the potential for adapting the s/w from this project to other experiments?
• What do you see as the future needs, deliverables and effort needed for the Project Activity
Reviewed answer the questions.Reviewers write short report.Actual reviews are by phone and quite “informal”.
Goal is as input to next years planning……done 2 reviews 5 more this week
14R. Pordes, Atlas Software Mtg, May 6th 2002
CS-11 Analysis Tools
“interface and integrate interactive data analysis tools with the grid and to identify common components and services.”
First:— identify appropriate individuals to participate in this area,
within and from outside of PPDG – several identified from each experiment
— assemble a list of references to white papers, publications, tools and related activities – available on http://www.ppdg.net/pa/ppdg-pa/idat/related-info.html
— produce a white paper style requirements document as an initial view of a coherent approach to this topic – draft circulated by June
— develop a roadmap for the future of this activity – at/post face-to-face meeting
15R. Pordes, Atlas Software Mtg, May 6th 2002
Generic data flow in HENP ?“Skims”,
“microDST production”, …Filtering chosen to make
this a convenient size$100M, 10 yr, 100 people
10 yr, 20 people
1 yr, 50 people, 5x/yr
1 mo, 1 person, 100x/yr
What’s going on in this box?Is this picture anywhere close to reality?
Many groups grappling with requirements now..
16R. Pordes, Atlas Software Mtg, May 6th 2002
Analysis of large datasets over the Grid
• Dataset does not fit on disk: Need access s/w to couple w/ processing; Distributed management implementing global experiment and local site policies
• Demand significantly exceeding available resources: Queues always full. When/how to move job and/or data; Global optimization of (or at least not totally random) total system throughput without too many local constraints (e.g. single points of failure)
• Data and Job Definition – in physicist terminology . For D0-SAM web+cl interface to specify Dataset + Dataset Snapshots. Saved in RDBMS for tracking and reuse. Many “dimensions” or attributes can be combined to define a dataset; Definitions can be iterative, extended; New versions defined at a specific date;
• Distributed processing and control: Schedule, control and monitor access to shared resources – CPU, disk, network. E.g. All D0-SAM job executions pass through a SAM-wrapper and are tracked in the database for monitoring and analysis.
• Faults of all kind occur: Preemption, exceptions, resource unavaibility; crashes,; Checkpointing and Restart; Workflow management to complete failed tasks; Error reporting and diagnosis;
• Chaotic and Large Spikes in Load; e.g.Analysis needs vary widely and difficult to predict – especially if a sniff of a new discovery..
• Estimation, Prediction, Planning, Partial Results - GriPhyN research areas.
17R. Pordes, Atlas Software Mtg, May 6th 2002
Use Cases -
• D0 has use cases and SAM support for some aspects— e.g. Submit and execute an analysis job at a site
temporarily isolated from the rest of the D0 Grid/ the FNAL site. If part of the dataset is not available locally, the system retries until the network restores or fails and reports the amount of data unavailable for delivery and processing. Critical for sites with unstable network connectivity Important for all other sites during the time of mission-critical analysis. Any output Files are catalogued and stored at least locally
• CMS has use cases in documents from Koen.
• Atlas use cases will be discussed later in this workshop
• Expect to benefit from RTAG to look at experiment use cases also.
18R. Pordes, Atlas Software Mtg, May 6th 2002
e.g. Analysis needs depends on the life-stage of the experiment:
the 5 seasons of SAM• DRAFT1. Design + Commissioning – Monte Carlo, and test raw/processing 2. Early Data Processing: This is the most chaotic of all data handling periods. The
data being taken is of inconsistent quality and the o-line processing extremely immature. Some subsets of this data are reconstructed many times. Data selection strategies are put into place and many need to be modied or re-executed. Much of the emphasis is on the RAW data. Integrated luminosity is low.
3. Mid-term: Around the middle of the running period, the reconstruction algorithm begins to stabilize, with new versions needed only every month or two. Much of the early data is reprocessed to provide complete and consistent data sets for physics analysis. The accelerator luminosity reaches new highs. Individual events are selected from raw data and cached at about the 10% level.
4. Late-term Steady-state: By the last third to quarter of the run, the inertia for changing the reconstruction program will become very large as data accumulates quickly enthusiasm for change fades. The experiment enters a steady state, and the chaos is at a low. Only partial processing of data, or xing, is attempted due to the long lead-times caused by I/O and processing overheads. Raw events continue to be cached at the 10% level. Record luminosities are recorded.
5. Post-run: Some processing is done after the data taking period ends. No new data is added to the input repository. The caches are built and access to the raw data diminishes rapidly.
19R. Pordes, Atlas Software Mtg, May 6th 2002
STACS
http://sdm.lbl.gov/projectindividual.php?ProjectID=STACS
20R. Pordes, Atlas Software Mtg, May 6th 2002
References supplied by PPDG participants to date
• Proposal to NSF for CMS Analysis: an Interactive Grid-Enabled Environment (CAIGEE) - Julian Bunn, Caltech
• Grid Analysis Environment work at Caltech, April 2002 - Julian Bunn, Caltech• Views of CMS Event Data – Koen, Caltech• ATLAS Athena & Grid - Craig Tull, LBNL• CMS Distributed analysis workshop, April 2001 - Koen Holtman, Caltech• PPDG-8, Comparison of datagrid tools capabilities - Reagan Moore, SDSC• Interactivity in a Batched Grid Environment - David Liu, UCB• Deliverables document from Crossgrid WP4
Portals, UI examples, etc.links • GENIUS: Grid Enabled web eNvironment for site Independent User job
Submission - Roberto Barbera, INFN• SciDAC CoG Kit (Commodity Grid Kit) • ATLAS Grid Access Portal for Physics Applications XCAT, a
Common Component Architecture implementation
21R. Pordes, Atlas Software Mtg, May 6th 2002
Tools etc
• Java Analysis Studio JASTony Johnson, SLAC
• Distributed computing with JAS (prototype) linkTony Johnson, SLAC
• Abstract Interfaces for Data Analysis (AIDA) homeTony Johnson, SLAC
• BlueOx: Distributed Analysis with Java (Jeremiah Mans, Princeton)
• homeTony Johnson, SLAC
• Parallel ROOT Facility, PROOF intro, slides, update Fons Rademakers, CERN
• Integration of ROOT and SAM info, example Gabriele Garzoglio, FNAL
• Clarens Remote Analysis infoConrad Steenberg, Caltech
• IMW: Interactive Master-Worker Style Parallel Data Analysis Tool on the Grid linkMiron Livny, Wisconsin
• SC2001 demo of Bandwidth Greedy Grid-enabled Object Collection Analysis for Particle Physics linkKoen Holtman, Caltech
22R. Pordes, Atlas Software Mtg, May 6th 2002
CS-11- Short term Status
• The requirements document is now in the process of being outlined – Joseph Perl, Doug Olson – based on posted contributions.
• A workshop is being planned to bring people together at LBL in mid June (18?19?). We won’t know more specifics til after the meeting.. Clearly Experiments starting to think about Remote Analysis (D0), Analysis for Grid simulatiojn production (CMS), and ATLAS/ALICE
• Many experiments (will) use ROOT (& carrot? proof?). In conjunction with Run2 visit to Fermilab, Rene will have discussions with PPDG and CS groups in the last week of May.
• Need to identify the narrow band in which PPDG can be a contributor rather than just adding to the meeting load: Keep to our mission of using/extending existing tools “for real” over the short/medium term (but encourage and do not derail needed longer term development work!)
23R. Pordes, Atlas Software Mtg, May 6th 2002
Other Near Term Plans for PPDG
• Job Management and Scheduling Workshop – common components proposed to date are GRAM, ClassAds, GSI, DAGMAN:— Review the model of Grid Job and Data Distribution and
Scheduling.— Review Experiment technical requirements.— Understand if cross-cut activities are appropriate
• VO Policies and Procedures. Work with SiteAA, CAS, DOE Science Grid and the Experiments to put in place US VO process and support. – expecting the security people to call a phone meeting here.
• Extend contributions to and use of Glue and VDT.
• Continue and extend collaboration as part of US Physics Grid Projects and international grid projects serving HENP experiments.
• Write Year 2 Plan.
• Look towards SC2002 demos and experiment data challenges as practical milestones.