Upload
timothy-bentley
View
221
Download
4
Tags:
Embed Size (px)
Citation preview
David Colling GridPP Edinburgh 6th November 2001
SAM ... an overview
(Many thanks to Vicky White, Lee Lueking and Rod Walker)
http://d0db.fnal.gov/sam
David Colling GridPP Edinburgh 6th November 2001
SAM stands for “Sequential Access to Data via Metadata”. Where sequential refers to the events stored within files.
Lauri Loebel-Carpenter, Lee Lueking*, Carmenita Moore, Igor Terekhov, Julie Trumbo, Sinisa Veseli, Matthew Vranicar, Stephen P. White, Victoria White*. (*project leaders)
The current SAM development team include:
Recently some work in the UK by Rod Walker
David Colling GridPP Edinburgh 6th November 2001
History of SAMProject started in 1997Built for the DØ “virtual organisation”(~500 physicists, 72 institutions, 18 countries)
SAM’s objectives are:• to provide a world wide system of shareable computing and storage resources. So providing a solution to the common problem of extracting physics results from about a Petabyte of data (c. 2003)•to provide a large degree of transparency to the user. Who makes requests for datasets, submits jobs and stores files (together with extensive metadata about the processing steps etc.)
David Colling GridPP Edinburgh 6th November 2001
Currently SAM’s storage and delivery of data is far more advanced than its job submission.
SAM is an operational prototype of many of the concepts being developed for Grid computing.
David Colling GridPP Edinburgh 6th November 2001
DatabaseServer(s)(Central Database)
NameServer
Global Resource
Manager(s)Log server
Station 1Servers
Station 2Servers
Station 3 Servers
Station nServers
Mass Storage System(s)
SharedGlobally
Local
SharedLocally
Arrows indicateControl and data flow
Overview of SAM
David Colling GridPP Edinburgh 6th November 2001
Name Sever allows all components to find each other by name
The Database server has numerous methods which process transactions and retrieve information from the central database
The Resource managercontrol efficient use of resources such as tape stores
The Log server gathers information from the entire system for monitoring and debugging All communication is via CORBA
David Colling GridPP Edinburgh 6th November 2001
The SAM station
A SAM station is deployed on local processing platforms
A station is unshared outside its set of CPU and disk resources.
Stations can communicate directly with each other, and data cached at one stations cache can be replicated at other stations upon demand.
Local groups of stations can, at a physical site, can share a locally available mass storage system (e.g. FermiLab)
David Colling GridPP Edinburgh 6th November 2001
The SAM station
The station’s resposibilities include:
Storing and retrieving data files from mass storage and other stations.
Managing data stored on cache disk.
Launching Project managers which oversee the processing of data requests by consumers in well defined projects.
All these functions are provided by the servers within a station.(See next slide)
David Colling GridPP Edinburgh 6th November 2001
File Stager(s)
Station &Cache
Manager
File Storage Server
Project Managers
/Consumers
eworkers
FileStorageClients
MSS orOtherStation
MSS orOtherStation
Data flowControl
Producers/
Cache DiskTemp Disk
The SAM Station
David Colling GridPP Edinburgh 6th November 2001
The SAM Station
The Station Manager oversees the removal of filescached on disk, and instructs the File Stager to add new files.All processing projects are started through the Station Server which starts Project Managers. Files are added to the system through the File Storage Server (FSS), which uses the Stagers to initiate transfers to the available MSS or another station.
David Colling GridPP Edinburgh 6th November 2001
A Station Job Manager provides services to execute a user application, script, or series of jobs, potentially as parallel processe either interactively or by use of a local batch system.
Currently supported are LSF and FBS, Condor and PBS adapters are under constructed and are being tested.
The station Cache Manager and Job Manager are implemented as a single “Station Master” server.
Job submission and synchronization between job execution and data delivery is currently part of SAM. Jobs are put on hold in batch system queues until data files are available to the job. At present jobs submitted at one station may only be run using the batch system(s) available at that Station.
The SAM Station
David Colling GridPP Edinburgh 6th November 2001
The User Interface
UIs are provided add data, access data, set configurations parameters and monitor the system.
These take the forms of Unix command line, Web GUIs and Python API. There is also a C++ interface for accessing data through a standard DØ framework package.
David Colling GridPP Edinburgh 6th November 2001
Defining a dataset
David Colling GridPP Edinburgh 6th November 2001
Examining a predefined dataset
David Colling GridPP Edinburgh 6th November 2001
Querying Cached Files
David Colling GridPP Edinburgh 6th November 2001
The SAM station
Real Data files from FNAL
MC files from NIKHEF
David Colling GridPP Edinburgh 6th November 2001
The SAM station
sam submit --defname=run129194_reco --cpu-per-event=2m --group=dzero--batch-system-flags="--universe=vanilla --output=condor.out--log=condor.log --error=condor.error--initialdir=/home/walker/TestSam/blife/BLifetime_x-run13264x_reco_p1004--arguments='-rcp framework.rcp -input_file SAMInput: -output_fileoutputfile -out BLifetime_x.out -log BLifetime_x.log -time -mem'"--framework-exe=./BLifetime_x
The SAM submit command
Starts project and submits job to Condor BS
David Colling GridPP Edinburgh 6th November 2001
MSU
Columbia
UTA64
Lyon/IN2P3100
Prague32
ImperialCollege
Lancaster200
NIKHEF50
Fermilab
SuperJanet
SURFnetESnet
Abilene
= MC production centers
The DØ SAM World
Also a UCL-CDF-test station
David Colling GridPP Edinburgh 6th November 2001
SAM Works now!
#Transfers initiated between 9:30 and 12:30 (Thursday 25 Oct 2001)+---------------------+--------------------------+-------+--------------+| from station | to station | #files | tot_size (KB)|+---------------------+--------------------------+-------+--------------+| ccin2p3-analysis | central-analysis | 51 | 4694053| central-analysis | clued0 | 43 | 4970595| central-analysis | enstore | 138 | 35715952| central-analysis | imperial-test | 19 | 6499500| datalogger-d0olb | enstore | 54 | 21833665| datalogger-d0olc | enstore | 34 | 5638370| enstore | central-analysis | 20 | 2836508| enstore | clued0 | 20 | 5290084| enstore | linux-analysis-cluster-1 | 27 | 8207554| hoeve | central-analysis | 67 | 25890902| lancs | central-analysis | 21 | 5588544| prague-test-station | central-analysis | 2 | 1530437| uta-hep | central-analysis | 5 | 1165404+---------------------+--------------------------+-------+--------------+
David Colling GridPP Edinburgh 6th November 2001
Compute systems and Storage systems in US – Fermilab, UTA, Columbia, MSU, France/Lyon-IN2P3, UK/Lancaster and Imperial College, Netherlands/NIKHEF, Czech Republic/PragueMany other sites are expected to provide additional compute and storage resources when the experiment moves from commissioning to physics data taking. Storage systems consist of disk storage elements at all locations and robotically controlled tape libraries at Fermilab, Lyon and Nikhef and Lancaster (almost)
All storage elements support the basic functions of storing or retrieving a file. Some support parallel transfer protocols, currently via bbftpThe underlying storage management systems for tape storage elements are different at Fermilab, Lyon and Nikhef. Fermilab tape storage management system, Enstore, provides the ability to assign priorities and file placement instructions to file requests and provides reports about placement of data on tape, queue wait time, transfer time and other information that can be used for resource management.
The Fabric
David Colling GridPP Edinburgh 6th November 2001
Interim Conclusions
SAM is a sophisticated tool for data transfer, and a less sophisticated tool for job submission.
SAM works now, and has real users!
SAM is an operational prototype of many of the concepts being developed for Grid computing.
David Colling GridPP Edinburgh 6th November 2001
Interim Conclusions
However, significant parts of SAM will have to be enhanced (or replaced) before it can truly claim to be a data grid. This work will happen as part of the Particle Physics Data Grid (PPDG) project.
Current status will be in black, planned enhancements will be in bold red. The following slides are extracts from Vicky White’s Talk “SAM and PPDG” CHEP 2001
Fab
ric
Tape Storage
Elements
Request Formulator and
Planner
Client Applications
Compute Elements
Indicates component that will be replaced
Disk Storage
Elements
LANs andWANs
Resource and Services Catalog
Replica Catalog
Meta-data Catalog
Authentication and SecurityGSISAM-specific user, group, node, station registration Bbftp ‘cookie’
Connectivity and Resource
CORBA UDP File transfer protocols - ftp, bbftp, rcp GridFTP
Mass Storage systems protocolse.g. encp, hpss
Collective
Services
Catalogprotocols
Significant Event Logger Naming Service Database ManagerCatalog Manager
SAM Resource ManagementBatch Systems - LSF, FBS, PBS,
CondorData MoverJob Services
Storage ManagerJob ManagerCache ManagerRequest Manager
“Dataset Editor” “File Storage Server”“Project Master” “Station Master” “Station Master”
WebPython codes, Java codes Command line
D0 Framework C++ codes
“Stager”“Optimiser”
CodeRepostory
Name in “quotes” is SAM-given software component name
or addedenhanced using PPDG and Grid tools
David Colling GridPP Edinburgh 6th November 2001
Enhancing SAM
The Job Manager is limited and can only submit to local resources.
The specification of user jobs, including their characteristics and input datasets, is a major component of the PPDG work.
The intention is to provide Grid job services components that replace the SAM job services components . This will support job submission (including composite and parallel jobs) to suitable SAM Station(s) and eventually any available Grid computing resource.
David Colling GridPP Edinburgh 6th November 2001
Unix user names, physics groups, nodes, domains and stations are registered. Valid combinations of these must be provided to obtain services. Station servers at one station provide service on behalf of their local users and are ‘trusted’ by other Station servers or Database Servers. Globus core Security Infrastructure services is a planned PPDG enhancement of the system. Service registration and discovery is implemented using a CORBA naming service, with namespace by station name. APIs to services in SAM are all defined using CORBA Interface Definition Language and have multiple language bindings (C++, Python, Java) and, in many cases, a shell interface. Use of GridFTP and other standard protocols to access storage elements is a planned PPDG modification to the system. Integration with grid monitoring tools and approaches is a PPDG area of research. Registration of resources and services using a standardized Grid registration or enquiry protocol is a PPDG enhancement to the system.
Enhancing SAM
David Colling GridPP Edinburgh 6th November 2001
Database Managers provide access to the Replica Catalog, Metadata Catalog, SAM Resource and configuration catalog and Transformation catalog.
All catalogs currently are tables in a central Oracle database; a matter that is hidden from their clients. Replication of some catalogs in two or more locations worldwide is a planned enhancement to the system.
Database managers will need to be enhanced to adapt SAM-specific APIs and catalog protocols onto Grid catalog APIs using PPDG-supported Grid protocols so that information may be published and retrieved in the wider Physics Data Grid that spans several virtual organizations. A central Logging server receives significant events.
This will be refined to receive only summary level information, with more detailed monitoring information held at each site.
Work in the context of PPDG will examine how to use a Grid Monitoring Architecture and tools.
Enhancing SAM
David Colling GridPP Edinburgh 6th November 2001
Resource manager services are provided by an “Optimization” service. File transfer actions are prioritized and authorized prior to being executed. The current primitive functionality of re-ordering and grouping file requests, primarily to optimize access to tapes, will need to be greatly extended, redesigned and re-implemented to better deal with co-location of data with computing elements and fair-shares and policy-driven use of all computing, storage and network resource. This is a major component of the SAM/PPDG work, to be carried out in collaboration with the Condor team.
Enhancing SAM
David Colling GridPP Edinburgh 6th November 2001
Enhancing SAM
Other enhancement also needed for scalability e.g. relies on a single Oracle database, which is a single point of failure. Needs replication/cache. Etc etc ...
David Colling GridPP Edinburgh 6th November 2001
Conclusions
SAM already does a lot and planned enhancements will give it far greater functionality.