20
1 DIRAC agents A.Tsaregorodtsev, CPPM, Marseille ARDA Workshop, 7 March 2005, CERN

DIRAC agents

Embed Size (px)

DESCRIPTION

DIRAC agents. A.Tsaregorodtsev, CPPM, Marseille. ARDA Workshop, 7 March 2005, CERN. Outline. DIRAC in a nutshell Architecture and components Agent based deployment Conclusion. http://dirac.cern.ch. DIRAC in a nutshell. - PowerPoint PPT Presentation

Citation preview

Page 1: DIRAC agents

1

DIRAC agents

A.Tsaregorodtsev, CPPM, Marseille

ARDA Workshop, 7 March 2005, CERN

Page 2: DIRAC agents

2

Outline

DIRAC in a nutshell

Architecture and components

Agent based deployment

Conclusion

http://dirac.cern.ch

Page 3: DIRAC agents

3

DIRAC in a nutshell

LHCb grid system for the Monte-Carlo simulation data production and analysis

Integrates computing resources available at LHCb production sites as well as on the LCG grid

Composed of a set of light-weight services and a network of distributed agents to deliver workload to computing resources

Runs autonomously once installed and configured on production sites

Implemented in Python, using XML-RPC service access protocol

DIRAC – Distributed Infrastructure with Remote Agent Control

Page 4: DIRAC agents

4

DIRAC scale of resource usage

Deployed on 20 “DIRAC”, and 40 “LCG” sites

Effectively saturated LCG and all available computing resources during the 2004 Data Challenge

Supported up to 4500 simultaneous jobs across 60 sites

Produced, transferred, and replicated 80 TB of data, plus meta-data

Consumed over 425 CPU years during last 4 months

Page 5: DIRAC agents

5

DIRAC design goals

Light implementation Must be easy to deploy on various platforms Non-intrusive

• No root privileges, no dedicated machines on sites Must be easy to configure, maintain and operate

Using standard components and third party developments as much as possible

High level of adaptability There will be always resources outside LCGn domain• Sites that can not afford LCG, desktops, …

We have to use them all in a consistent way Modular design at each level

Adding easily new functionality

Page 6: DIRAC agents

6

DIRAC Services and Resources

DIRAC JobManagement

Service

DIRAC JobManagement

Service

DIRAC CEDIRAC CEDIRAC CEDIRAC CE

DIRAC CEDIRAC CE

DIRAC SitesDIRAC Sites

AgentAgent

Productionmanager

Productionmanager GANGA UIGANGA UI User CLI User CLI

JobMonitorSvcJobMonitorSvc

JobAccountingSvcJobAccountingSvc

AccountingDB

Job monitorJob monitor

ConfigurationSvcConfigurationSvc

FileCatalogSvcFileCatalogSvcBookkeepingSvcBookkeepingSvc

BK query webpage BK query webpage

FileCatalogbrowser

FileCatalogbrowser

DIRACservices

DIRAC StorageDIRAC Storage

DiskFileDiskFile

gridftpgridftp

LCGLCGResourceBroker

ResourceBroker

CE 1CE 1 CE 2CE 2

CE 3CE 3AgentAgent

DIRACresources

FileCatalog

AgentAgentAgentAgent

Page 7: DIRAC agents

7

Agent ContainerAgent Container

JobAgentJobAgent

PendingJobAgentPendingJobAgent

BookkeepingAgentBookkeepingAgent

TransferAgentTransferAgent

MonitorAgentMonitorAgent

CustomAgentCustomAgent

DIRAC: Agent modular design

Agent is a container of pluggable modules Modules can be added dynamically

Several agents can run on the same site Equipped with different set of modules as defined in their configuration

Data management is based on using specialized agents running on the DIRAC sites

Page 8: DIRAC agents

8

DIRAC workload management

Realizes PULL scheduling paradigm

Agents are requesting jobs whenever the corresponding resource is free

Using Condor ClassAd and Matchmaker for finding jobs suitable to the resource profile

Agents are steering job execution on site

Jobs are reporting their state and environment to central Job Monitoring service

Task queueTask queue

Job DB

Task queueOptimizer 1

Optimizer 2Job Receiver

MatchMaker

Agent AgentAgent

ComputingElement ComputingElement

ComputingElement

Job Monitor Svc

WMS

Job

Us

er

Inte

rfa

ce Task queueTask queue

Task queueTask queue

Job DB

Task queueTask queueOptimizer 1

Optimizer 2Job Receiver

MatchMaker

Agent AgentAgent

ComputingElement ComputingElement

ComputingElement

Job Monitor Svc

WMS

Job

Us

er

Inte

rfa

ce

Page 9: DIRAC agents

9

Data management tools DIRAC Storage Element is a combination of a standard server and a description of its access in the Configuration Service Pluggable transport modules: gridftp,bbftp,sftp,ftp,http, …

DIRAC ReplicaManager interface (API and CLI) get(), put(), replicate(), register(), etc

Reliable file transfer

Request DB keeps outstanding transfer requests A dedicated agent takes file transfer request and retries it until it is successfully completed Using WMS for data transfer monitoring

Transfer Agent

Requests DB

JobData ManagerData Optimizer

Local SE Remote SE cache

Site Data Management

Transfer Agent

Requests DBRequests DB

JobData ManagerData Optimizer

Local SE Remote SE cache

Site Data Management

Page 10: DIRAC agents

10

Dynamically deployed agents

How to involve the resources where the DIRAC agents are not yet installed or can not be installed ?

Workload management with resource reservation Sending agent as a regular job Turning a WN into a virtual LHCb production site

This strategy was applied for DC04 production on LCG: Effectively using LCG services to deploy DIRAC infrastructure on the LCG resources

Efficiency: >90 % success rates for DIRAC jobs on LCG While 60% success rates of LCG jobs

• No harm for the DIRAC production system One person ran the LHCb DC04 production in LCG

Page 11: DIRAC agents

11

Dynamically deployed agents (2)

Agent installation pseudo-script:> wget http://…/dirac-install> dirac-install [–g]> dirac-agent etc/Transfer.ini> dirac-agent etc/Job.ini

Agents are written in python Guaranteed friendly environment for running

Necessary globus/lcg libraries and tools are shipped together with agent distribution Needed outside LCG Packed in a platform independent way (R.Santinelli)• http://grid-deployment.web.cern.ch/grid-deployment/eis/docs/lcg-util-client.tar.

gz

Page 12: DIRAC agents

12

Instant Messaging in DIRAC

Jabber/XMPP IM asynchronous, buffered, reliable messaging framework connection based

• Authenticate once • “tunnel” back to client – bi-directional connection with justoutbound connectivity (no firewall problems, works with NAT)

Used in DIRAC to send control instructions to components (services,

agents, jobs)• XML-RPC over Jabber• Broadcast or to specific destination

monitor the state of components interactivity channel with running jobs

Page 13: DIRAC agents

13

Agent based deployment

Most of the DIRAC subsystems are a combination of central services and distributed agents

Mobile agents as easily deployable software components can form the basis of the new paradigm for the grid middleware deployment The application software is deployed by a special kind of jobs / agent

The distributed grid middleware components (agents) can be deployed in the same way

Page 14: DIRAC agents

14

Deployment architecture

CE

Host CE

SRM WMS agentDMS agent

Mon agent

Site services

VO services

WMS

DMS

Catalogs Config

AccountingVOMS

Central services

Local services

WMS agent

Monitoring

VO disk space

Page 15: DIRAC agents

15

Host CE

Standard CE with specific configuration

Restricted access – for VO admins only

Out/Inbound connectivity

Visible both to external WAN and to the Intranet

Provides VO dedicated disk space, database access: VO software Requests database

No wall clock time limit for jobs

No root privileges for jobs

Page 16: DIRAC agents

16

Natural distribution of responsibilities

Site is responsible for providing resources Site administrators should be responsible only for deploying software providing access to resources• Computing resources – CE (Condor-C ?)• Storage resources – SE (SRM ?)

Managing the load on the grid is the responsibility of the VO VO administrators should be responsible for deploying VO services components - even those running on sites

Page 17: DIRAC agents

17

Services breakdown Inspired from the Dirac experience Deployment by RCs of basic services:

Fabric Basic CE for WM and SRM for DM (on top of MSS and/or DMP)

File access services (rootd / xrootd / rfio…)

Deployment by VOs of higher level services Using deployment facilities made available on site

Additional baseline services File catalogue(s): must be robust and performant

Other services: a la carte

Page 18: DIRAC agents

18

TapeDisk

MSS

SRM

WNs

LSF

WMS agent

Storagesystem

FRCCentralservices

Localservices

Monitoring&

accounting

Fabric

Deployedcentral

Services

WMS

LocalRC

HostService

CECE

Mon. agent

Transfer agent

rfiodrootd

FilePlacement

System

Batchsystem

Application

GFAL

xxftpd

Bas

elin

e se

rvic

esV

O

serv

ices

Page 19: DIRAC agents

19

File Catalogs

DIRAC incorporated 2 different File Catalogs Replica tables in the

LHCb Bookkeeping Database

File Catalog borrowed from the AliEn project

MySQLMySQL

AliEn FCAliEn FC AliEn UIAliEn UI

XML-RPCserver

XML-RPCserver

AliEn FCClient

AliEn FCClient

ORACLEORACLE

LHCbBK DBLHCbBK DB

XML-RPCserver

XML-RPCserver

BK FCClientBK FCClient

FC ClientFC ClientDIRAC

Application,Service

DIRACApplication,

Service

AliEn FileCatalog ServiceAliEn FileCatalog Service

BK FileCatalog Service BK FileCatalog Service

FileCatalog ClientFileCatalog Client

MySQLMySQL

AliEn FCAliEn FC AliEn UIAliEn UI

XML-RPCserver

XML-RPCserver

AliEn FCClient

AliEn FCClient

ORACLEORACLE

LHCbBK DBLHCbBK DB

XML-RPCserver

XML-RPCserver

BK FCClientBK FCClient

FC ClientFC ClientDIRAC

Application,Service

DIRACApplication,

Service

AliEn FileCatalog ServiceAliEn FileCatalog Service

BK FileCatalog Service BK FileCatalog Service

FileCatalog ClientFileCatalog Client

Both catalogs have identical client API’s Can be used interchangeably This was done for redundancy and for gaining experience

Other catalogs will be interfaced in the same way

LFC FiReMan

Page 20: DIRAC agents

20

Conclusions

DIRAC provides an efficient MC production system

Data reprocessing with DIRAC has started Using DIRAC for analysis is still to be demonstrated

DIRAC needs further developments and integration of third party components

Relations with other middleware developments are still to be defined following thorough assessments

We believe that DIRAC experience can be useful for the LCG and other grid middleware developments and deployment