ATLAS Production Kaushik De University of Texas At Arlington LHC Computing Workshop, Ankara May 2, 2008

ATLAS Production

Kaushik DeKaushik De

University of Texas At ArlingtonUniversity of Texas At Arlington

LHC Computing Workshop, AnkaraLHC Computing Workshop, Ankara

May 2, 2008May 2, 2008

May 2, 2008May 2, 2008Kaushik DeKaushik De 2

Outline

Computing GridsComputing Grids

Tiers of ATLASTiers of ATLAS

PanDA Production SystemPanDA Production System

MC Production StatisticsMC Production Statistics

Distributed Data ManagementDistributed Data Management

Operations ShiftsOperations Shifts

Thanks to T. Maeno (BNL), R. Rocha (CERN), J. Shank Thanks to T. Maeno (BNL), R. Rocha (CERN), J. Shank (BU) for some of the slides presented here(BU) for some of the slides presented here


EGEE Grid

+ Nordugrid - NDGF (Nordic countries)


OSG – US Grid


Tiers of ATLAS

10 Tier 1 centers10 Tier 1 centers Canada, France, Germany, Italy, Netherlands, Nordic Countries,

Spain, Taipei, UK, USA

~35 Tier 2 centers~35 Tier 2 centers Australia, Austria, Canada, China, Czech R., France, Germany,

Israel, Italy, Japan, Poland, Portugal, Romania, Russian Fed., Slovenia, Spain, Switzerland, Taipei, UK, USA

? Tier 3 centers? Tier 3 centers


Tiered Example – US Cloud

BNL T1

MW T2UC, IU

NE T2BU, HU

SW T2UTA, OU

GL T2UM, MSU

SLAC T2

IU OSG

UC Teraport

UTA DPCC

LTUUTD

OU Oscer

SMU

Wisconsin

Tier 3’s


Data Flow in ATLAS


Storage Estimate


Production System Overview

DQ2DQ2

Tasks requested by Physics Working Group

approved

Panda

ProdDBsubmit

CA,DE,ES,FR,IT, NL,TW,UK,US (NDGF coming)

Clouds

send jobs

register files


PANDA = Production ANd Distributed Analysis systemPANDA = Production ANd Distributed Analysis system Designed for analysis as well as production Project started Aug 2005, prototype Sep 2005, production

Dec 2005 Works both with OSG and EGEE middleware

A single task queue and pilotsA single task queue and pilots Apache-based Central Server Pilots retrieve jobs from the server as soon as CPU is

available low latency Highly automated, has an integrated monitoring system, and Highly automated, has an integrated monitoring system, and

requires low operation manpowerrequires low operation manpower Integrated with ATLAS Distributed Data Management (DDM) Integrated with ATLAS Distributed Data Management (DDM)

systemsystem Not exclusively ATLAS: has its first OSG user CHARMMNot exclusively ATLAS: has its first OSG user CHARMM

Panda


CloudPanda

storage

Tier 1

storage

Tier 2s

job

output files

output files

output files

job

input files

input files

input files


Panda/Bamboo System Overview

site A

Panda server

site B

pilotpilot

Worker Nodes

condor-g

AutopilotAutopilot

https

https

submit

pull

End-user

submit

job

pilotpilot

ProdDBjob

job

loggerloggerhttp

send log

bamboobambooLRC/LFC

DQ2DQ2


Apache + gridsite

PandaDB

clientsPanda server

Panda server

Central queue for all kinds of jobsCentral queue for all kinds of jobs Assign jobs to sites (brokerage)Assign jobs to sites (brokerage) Setup input/output datasetsSetup input/output datasets

Create them when jobs are submitted Add files to output datasets when jobs are finished

Dispatch jobsDispatch jobs

loggerlogger

LRC/LFC

DQ2DQ2

https

pilotpilot


Apache + gridsite

cx_Oracle

Panda server

https

https

prodDB

cron

Bamboo

Bamboo

Get jobs from prodDB to submit them to PandaGet jobs from prodDB to submit them to Panda

Update job status in prodDBUpdate job status in prodDB

Assign tasks to clouds dynamicallyAssign tasks to clouds dynamically

Kill TOBEABORTED jobsKill TOBEABORTED jobs

A cron triggers the above procedures every 10 minA cron triggers the above procedures every 10 min


HTTP/S-based communication HTTP/S-based communication (curl+grid proxy+python)(curl+grid proxy+python)

GSI authentication via mod_gridsiteGSI authentication via mod_gridsite

Most of communications are asynchronousMost of communications are asynchronous Panda server runs python threads as soon as it receives

HTTP requests, and then sends responses back immediately. Threads do heavy procedures (e.g., DB access) in background better throughput

Several are synchronous

Client-Server Communication

serialize(cPickle)

HTTPS(x-www-form -urlencode)

Client UserIF

mod_python

Panda

Pythonobj

mod_deflate

Request

Pythonobj

deserialize(cPickle)

Pythonobj

Response


Data Transfer

Rely on ATLAS DDMRely on ATLAS DDM Panda sends requests to DDM DDM moves files and sends

notifications back to Panda Panda and DDM work

asynchronously Dispatch input files to T2s and aggregate Dispatch input files to T2s and aggregate

output files to T1output files to T1 Jobs get ‘activated’ when all input files Jobs get ‘activated’ when all input files

are copied, and pilots pick them upare copied, and pilots pick them up Pilots don’t have to wait for data

arrival on WNs Data-transfer and Job-execution can

run in parallel

Panda

submit Job

DQ2

subscribe T2 for disp dataset

callback

get Jobpilot

submitter

finish Job

callback

add files to dest datasets

run job

data transfer

data transfer


Pilot and Autopilot (1/2)

Autopilot is a scheduler to submit pilots to sites via Autopilot is a scheduler to submit pilots to sites via condor-g/glideincondor-g/glideinPilot GatekeeperJob Panda server

Pilots are scheduled to the site batch system and pull Pilots are scheduled to the site batch system and pull jobs as soon as CPUs become availablejobs as soon as CPUs become availablePanda server Job Pilot

Pilot submission and Job submission are differentPilot submission and Job submission are differentJob = payload for pilot


Pilot and Autopilot (2/2)

How pilot worksHow pilot works Sends the several parameters to Panda server for job

matching (HTTP request) CPU speed Available memory size on the WN List of available ATLAS releases at the site

Retrieves an `activated’ job (HTTP response of the above request)

activated running Runs the job immediately because all input files should be

already available at the site Sends heartbeat every 30min Copy output files to local SE and register them to Local

Replica Catalogue


Production vs Analysis

Run on same infrastructuresRun on same infrastructures Same software, monitoring system and facilities No duplicated manpower for maintenance

Separate computing resourcesSeparate computing resources Different queues different CPU clusters Production and analysis don’t have to compete with each

other Different policies for data transfersDifferent policies for data transfers

Analysis jobs don’t trigger data-transfer Jobs go to sites which hold the input files

For production, input files are dispatched to T2s and output files are aggregated to T1 via DDM asynchronously

Controlled traffics


Current PanDA production – Past Week


PanDA production – Past Month


MC Production 2006-07


ATLAS Data Management Software - Don Quijote

The second generation of the ATLAS DDM system (DQ2)The second generation of the ATLAS DDM system (DQ2) DQ2 developers M.Branco, D.Cameron, T.Maeno, P.Salgado, T.Wenaus,… Initial idea and architecture were proposed by M.Branco and T.Wenaus

DQ2 is DQ2 is built on top of Grid data transfer toolsbuilt on top of Grid data transfer tools Moved to dataset based approach

Datasets : an aggregation of files plus associated DDM metadata Datasets is a unit of storage and replication Automatic data transfer mechanisms using distributed site services

Subscription system Notification system

Current version 1.0


DDM components

DQ2 datasetcatalog

DQ2“Queued

Transfers”

LocalFile

Catalogs

FileTransferService

DQ2Subscription

Agents

DDM end-user tools (T.Maeno, BNL)(dq2_ls,dq2_get, dq2_cr)


DDM Operations Mode

CERN

LYON

NG

BNL

FZK

RAL

CNAF

PIC

TRIUMF

SARA

ASGC

lapp

lpc

TokyoBeijing

Romania

grif T3

SWT2

GLT2NET2

WT2MWT2

T1T2 T3 VO box, dedicated computer

to run DDM services

US ATLAS DDM operations team :

BNL H.Ito, W.Deng,A.Klimentov,P.NevskiGLT2 S.McKee (MU)MWT2 C.Waldman (UC)NET2 S.Youssef (BU)SWT2 P.McGuigan (UTA)WT2 Y.Wei (SLAC)WISC X.Neng (WISC)

T1-T1 and T1-T2 associations according to GP

ATLAS Tiers associations.

All Tier-1s have predefined (software) channel with CERN and with each other.Tier-2s are associated with one Tier-1 and form the cloudTier-2s have predefined channel with the parent Tier-1 only.

LYON Cloud

BNL Cloud

TWT2

Melbourne

ASGC Cloud

wisc


Activities. Data Replication

Centralized and automatic (according to computing model) Centralized and automatic (according to computing model) Simulated data

AOD/NTUP/TAG (current data volume ~1.5 TB/week) BNL has a complete dataset replicas US Tier-2s are defined what fraction of data they will keep

– From 30% to 100% . Validation samples

Replicated to BNL for SW validation purposes Critical Data replication

Database releases replicated to BNL from CERN and then from BNL to US ATLAS T2s. Data

volume is relatively small (~100MB) Conditions data

Replicated to BNL from CERN Cosmic data

BNL requested 100% of cosmic data. Data replicated from CERN to BNL and to US Tier-2s

Data replication for individual groups, Universities, physicistsData replication for individual groups, Universities, physicists Dedicated Web interface is set up


Data Replication to Tier 2’s


You’ll never walk alone

Weekly Throughput

2.1 GB/s out of CERN

From Simone Campana


Subscriptions

SubscriptionSubscription Request for the full replication of a dataset (or dataset

version) at a given siteRequests are collected by the centralized Requests are collected by the centralized

subscription catalogsubscription catalogAnd are then served by a site of agents – the And are then served by a site of agents – the site site

servicesservicesSubscription on a dataset versionSubscription on a dataset version

One time only replicationSubscription on a datasetSubscription on a dataset

Replication triggered on every new version detected Subscription closed when dataset is frozen


Site Services

Agent based frameworkAgent based framework GoalGoal: Satisfy subscriptions: Satisfy subscriptions Each agent serves a specific part of a requestEach agent serves a specific part of a request

Fetcher: fetches up new subscription from the subscription catalog Subscription Resolver: checks if subscription is still active, new

dataset versions, new files to transfer, … Splitter: Create smaller chunks from the initial requests, identifies

files requiring transfer Replica Resolver: Selects a valid replica to use as source Partitioner: Creates chunks of files to be submitted as a single

request to the FTS Submitter/PendingHandler: Submit/manage the FTS requests Verifier: Check validity of file at destination Replica Register: Registers new replica in the local replica catalog …


Typical deployment

Deployment at Tier0 similar to Tier1sDeployment at Tier0 similar to Tier1s

LFC and FTS services at Tier1sLFC and FTS services at Tier1s

SRM services at every site, including Tier2sSRM services at every site, including Tier2s

SiteServices

SiteServices

SiteServices

CentralCatalogs


Interaction with the grid middleware

File Transfer Services (FTS)File Transfer Services (FTS) One deployed per Tier0 / Tier1 (matches typical site

services deployment) Triggers the third party transfer by contacting the SRMs,

needs to be constantly monitored LCG File Catalog (LFC)LCG File Catalog (LFC)

One deployed per Tier0 / Tier1 (matches typical site services deployment)

Keeps track of local file replicas at a site Currently used as main source of replica information by

the site services Storage Resource Manager (SRM)Storage Resource Manager (SRM)

Once pre-staging comes into the picture


DDM - Current Issues and Plans

Dataset deletionDataset deletion Non trivial, although critical First implementation using a central request repository Being integrated into the site services

Dataset consistencyDataset consistency Between storage and local replica catalogs Between local replica catalogs and the central catalogs Lot of effort put into this recently – tracker, consistency service

Prestaging of dataPrestaging of data Currently done just before file movement Introduces high latency when file is on tape

MessagingMessaging More asynchronous flow (less polling)


ADC Operations Shifts

ATLAS Distributed Computing Operations Shifts (ADCoS)ATLAS Distributed Computing Operations Shifts (ADCoS) World-wide shifts To monitor all ATLAS distributed computing resources To provide Quality of Service (QoS) for all data processing Shifters receive official ATLAS service credit (OTSMoU)

Additional informationAdditional information http://indico.cern.ch/conferenceDisplay.py?confId=22132 http://indico.cern.ch/conferenceDisplay.py?confId=26338


Typical Shift Plan

Browse recent shift historyBrowse recent shift history

Check performance of all sitesCheck performance of all sites File tickets for new issues Continue interactions about old issues

Check status of current tasksCheck status of current tasks Check all central processing tasks Monitor analysis flow (not individual tasks) Overall data movement

File software (validation) bug reportsFile software (validation) bug reports

Check Panda, DDM healthCheck Panda, DDM health

Maintain elog of shift activitiesMaintain elog of shift activities


Shift Structure

Shifter on callShifter on call Two consecutive days Monitor – escalate – follow up Basic manual interventions (site – on/off)

Expert on callExpert on call One week duration Global monitoring Advice shifter on call Major interventions (service - on/off) Interact with other ADC operations teams Provide feed-back to ADC development teams

Tier 1 expert on callTier 1 expert on call Very important (ex. Rod Walker, Graeme Stewart, Eric Lancon…)


Shift Structure Schematic

by Xavier Espinal


ADC Inter-relations

ADCoS

Central ServicesBirger Koblitz

Operations SupportPavel Nevski

Tier 0Armin Nairz

DDMStephane Jezequel

Distributed AnalysisDietrich Liko

Tier 1 / Tier 2Simone Campana

ProductionAlex Read



Documents

ATLAS Production Kaushik De University of Texas At Arlington LHC Computing Workshop, Ankara May 2, 2008