Summary Distributed Data Analysis Track F. Rademakers, S. Dasu, V. Innocente CHEP06 TIFR, Mumbai

SummarySummary

Distributed Data AnalysisDistributed Data AnalysisTrackTrack

F. Rademakers, S. Dasu, V. Innocente

CHEP06 TIFR, Mumbai

CHEP06, 17 Feb 2006 2 Fons Rademakers

OutlineOutline

Introduction Distributed Analysis Systems Submission Systems Bookkeeping Systems Monitoring Systems Data Access Systems Miscellaneous Conveners’ impressions

We have only 20 min for the summaryand therefore cannot do justice to all talks


Track StatisticsTrack Statistics

Lies, damn lies and statistics: there were 23 talks number of cancellations 2 number of no-shows 1 average attendance 25 minimum attendance 12 maximum attendance 55 average duration of talks 23 min equipment failures 1 (laser pointer) average outside temperature 31 C average room temperature 21 C


What Was This Track All About?What Was This Track All About?

DIAL

ProdSys

BOSS

Ganga

Analysis Systems

PROOF

CRAB

Submission Systems

DIRACPANDA

Bookkeeping Systems

JobMon

BOSS BbK

Monitoring Systems

DashBoard

JobMon

BOSS

MonaLisa

Data Access Systems

xrootd

SAM

Miscellaneous

Go4

ARDA

Grid Simulations

AJAX Analysis


Data Analysis SystemsData Analysis Systems

ALICE ATLAS CMS LHCb

PROOF DIAL

GANGA

CRAB

PROOF

GANGA

All systems support, or plan to support, parallelism Except for PROOF all systems achieve parallelism via job

splitting and serial batch submission (job level parallelism)

The different analysis systems presented, categorized by experiment:


Classical Parallel Data AnalysisClassical Parallel Data AnalysisStorageBatch farm

queues

manager

outputs

catalog

“Static” use of resources Jobs frozen, 1 job / CPU

“Manual” splitting, merging Limited monitoring (end of single job) Possible large tail effects

submit

files

jobsdata file splitting

myAna.C

mergingfinal analysis

query

From PROOF System by Ganis [98]


Interactive Parallel Data AnalysisInteractive Parallel Data Analysiscatalog StorageInteractive farm

scheduler

query

Farm perceived as extension of local PC More dynamic use of resources Automated splitting and merging Real time feedback Much better control of tail effects

MASTER

query:data file list, myAna.C

files

final outputs(merged)

feedbacks

(merged)

From PROOF System by Ganis [98]

CHEP06, 17 Feb 2006 8 Fons RademakersPrototype of a Parallel Analysis System for CMS using PROOF - I. González 11

CMSProofCMSProof –– Time & Speedup MeasurementsTime & Speedup Measurements

Real analysis used to select top quark pair production events with a tau (needs to be reconstructed) a lepton and two b quarks in the final stateProcessing ~800K events…

In 1 CPU ~ 4 hours (only event loop)In 80 CPUs ~4 minutes (only event loop)

Initialisation time (~3 minutes):Authentication is done on all slaves, even if unused

• Therefore not dependent on the number of slaves used

Remote environment settingCode uploading and compilation

• Only done for newer code• First time this takes quite some time

TChain initialisation• Very long for very distributed chains

Run time scales close to the ideal 1/ Ncpu


DIALDIALDistributed Interactive Analysis of Large DatasetsDistributed Interactive Analysis of Large Datasets

A useful DIAL system has been deployed for ATLAS Common analysis transformations Access to current data For AOD to histograms and large samples, 15 times faster than a

single process

Easy to use ROOT interface Web-based monitoring Packaged datasets, applications and example tasks

Demonstrated viability of remote processing Via Condor-G or PANDA Need interactive queues at remote sites

With corresponding gatekeeper or DIAL service Or improve PANDA responsiveness

From DIAL by Adams [39]

CHEP06, 17 Feb 2006 10 Fons RademakersD. Adams CHEP06 DIAL February 13, 2006 20ATLAS

DIAL 1.30 AOD processing time 2/10/06

0

600

1200

1800

2400

3000

3600

0 200 400 600 800 1000 1200Thousands of events

Tim

e (s

ec)

single job

(single job)/10

100 MB/s

50 MB/s

10k events

8feb-lfast-nfs-100

9feb-lshort-nfs-100

9feb-cgfast-nfs-100

9feb-panda-nfs-100

10feb-lfast-nfs-100

10feb-lfast-nfs-50

10feb-lfast-nfs-20

Single job

(Single job)/10


GangaGanga

Designed for data analysis on the Grid LHCb will do all its analysis on T1’s T2’s mostly for simulation

System should not be general – we know all main use cases

Use prior knowledge Identified use pattern

Aid user in Bookkeeping aspects Keeping track of many individual jobs

Developed in cooperation between LHCb and ATLAS

From LHCb Experiences by Egede [317]





CRABCRAB

Makes it easy to create large number of user analysis jobs

Assume all jobs are the same except for some parameters (event number to be accessed, output file name…)

Allows to access distributed data efficiently Hiding WLCG middleware complications. All interactions are

transparent for the end user

Manages job submission, tracking, monitoring and output harvesting

User doesn’t have to take care about how to interact with sometimes complicated grid commands

Leaves time to get a coffee …

Uses BOSS as Grid independent submission engine

From CRAB by Corvo [273]


CHEP ’06 Mumbai Marco Corvo – Cern/Cnaf 12

Some statistics

Most accessed sites since J uly 05

CRAB jobs so far

D.Spiga: CRAB Usage and jobs-flow Monitoring (DDA-252)


Submission SystemsSubmission Systems

ALICE ATLAS CMS LHCb

AliEn

(not presented)

ProdSys

PanDA

BOSS DIRAC

These systems are the DDA launch vehicles for the Grid based batch analysis solutions

The different submission systems, categorized by experiment:


ATLAS StrategyATLAS Strategy

ATLAS will use all three main Grids: LCG/EGEE OSG NorduGrid

ProdSys was developed to provide seamless access to all ATLAS grid resources

At this point emphasis on batch model to implement the ATLAS Computing model

Interactive solutions are difficult to realize on top of the current middleware layer

We expect our users to send large batches of short jobs to optimize their turnaround

Scalability Data Access

From ATLAS Strategy by Liko [263]


ProdDB

CECE CE

DulcineaDulcineaDulcinea

DulcineaDulcinea

LexorDulcinea

DulcineaCondorG

CG

PANDA

RBRB

RB

ATLAS Prodsys




BOSSBOSS

Batch Object Submission System A tool for batch job submission, real time monitoring

and book keeping Interfaced to many schedulers both local and grid Utilizes relational database for persistency Full logging and bookkeeping information stored Job commands: submit, kill, query and output retrieval Can define custom job types which allows specify

monitoring unique to the submitted application Significant new functionality identified and being

actively integrated into BOSS

From Evolution of BOSS by Wakefield [240]


BOSS WorkflowBOSS Workflow

boss submitboss queryboss kill BOSS

DB

BOSS Schedulerfarm node

farm node

Wrapper

User specifies job - parameters including: Executable name. Executable type - turn on customized monitoring. Output files to retrieve (for sites without shared file system and grid).

User tells Boss to submit jobs specifying scheduler i.e. PBS, LSF, SGE, Condor, LCG, GLite etc..

Job consists of job wrapper, Real time monitoring service and users executable.

From Evolution of BOSS by Wakefield [240]


DIRACDIRAC

CHEP 2006 (13th–17th February 2006) Mumbai, IndiaStuart K. Paterson 3

Introduction to DIRAC

The DIRAC Workload & Data Management System (WMS) is made up of Central Services and Distributed Agents

Realizes PULL scheduling paradigm

Agents are requesting jobs whenever the corresponding resource is availableExecution environment is checked before job is delivered to WN

Service Oriented Architecture masks underlying complexity

CHEP06, 17 Feb 2006 25 Fons RademakersCHEP 2006 (13th–17th February 2006) Mumbai, IndiaStuart K. Paterson 16

Comparison of 1 and 10 Users for Multi-Threaded Mode

0

50

100

150

200

250

300

350

400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

Start Time (Mins)

Nu

mb

er

of

Job

s

Multi-Threaded 1 User

Multi-Threaded 10 Users

Same Number of Jobs With Less Users

Two cases,1 user submitting 1000 jobs10 users submitting 100 jobs

CHEP06, 17 Feb 2006 26 Fons RademakersCHEP 2006 (13th–17th February 2006) Mumbai, IndiaStuart K. Paterson 20

Conclusions

The DIRAC API provides a simple yet powerful tool for users

Access to LCG resources is provided in a simple and transparent way

DIRAC Multi-Threaded and Filling modes show significant reductions on the job start times

Also reduce the load on LCG

Workload management on the level of the user is effective

Can be more powerful on the level of the VO

DIRAC infrastructure for distributed analysis is in place

Now have real users


Data Access SystemsData Access Systems

The different data access systems that were presented: SAM

Used by CDF in its CAF environment

xrootd serverUsed by BaBar, ALICE, STARAll BaBar sites run xrootd, extensive deployment experienceWinner of the SC05 throughput testPerforms better than even the developers ever expected and had hoped for

xrootd clientMany improvements in the xrootd client side codeReduce latencies using asynchronous read ahead, client side caching and asynchronous opens

CHEP06, 17 Feb 2006 28 Fons RademakersCHEP 13-17 February 2006 11: http://xrootd.slac.stanford.edu

ESnet routed ESnet SDN layer 2 via USN

SLAC to Seattle

BW Challenge

Seattle to SLAC

•SC2005 BW Challenge•Latency Bandwidth

•8 xrootd Servers•4@SLAC & 4@Seattle•Sun V20z w/ 10Gb NIC•Dual 1.8/2.6GHz Opterons•Linux 2.6.12

•1,024 Parallel Clients•128 per server

•35Gb/sec peak•Higher speeds killed router•2 full duplex 10Gb/s links•Provided 26.7% overall BW

•BW averaged 106Gb/sec•17 Monitored links total

I/O Bandwidth (wide area network)

http://www-iepm.slac.stanford.edu/monitoring/bulk/sc2005/hiperf.html

CHEP06, 17 Feb 2006 29 Fons RademakersCHEP 13-17 February 2006 12: http://xrootd.slac.stanford.edu

xrootd Server Scaling

Linear scaling relative to load Allows deterministic sizing of server

Disk

NIC

CPU

Memory

Performance tied directly to hardware cost Underlying hardware & software are critical



AcknowledgmentsAcknowledgments

A big thank you to the organizers And to the speakers for the high quality talks Especially the ones of whom the talks were not

properly summarized

Hope to see you all at CHEP07 to see how the Distributed Data Analysis Systems have evolved


Distributed Data analysis tools are of strategic importance GANGA, DIAL, CRAB, PROOF, … They can be a real differentiator There is a large development activity going on in this area However, none of these tools have yet been exposed to the expected large number of final analysis users

Development of a plethora of grid independent access layers DIRAC, BOSS, ALiEn, PanDA, … Gap between the grid middleware capabilities and user needs, especially data location, placement and

bookkeeping services, left room for this activity Although appropriate now, convergence to one or two tools is desired

CPU and data intensive portion of analysis is most suited for the grid Skimming and organized “rootTree making” is enabled by these DDA tools

Advantage of adapting production style tools to analysis Can one adapt other stuff from production toolbox? Bookkeeping?

Avoid arcane work-group level bookkeeping that is common currently

Interactive analysis on grid with its large latencies PROOF is taking advantage of co-located CPUs for interactive analysis

In the era of multi-core CPUs this is only natural Provides incremental data merging for prompt feedback to users

Most DDA tools coupled to high-latency batch systems aren’t quite capable Block reservation of co-located nodes, a la Condor MPI Universe, may enable PROOF capabilities over the

grid

High throughput AND low latency storage access critical for analysis Attention to performance boosting by deferred opens, caching and read-ahead by xrootd team is encouraging

Conveners’ ObservationsConveners’ Observations

Documents

Summary Distributed Data Analysis Track F. Rademakers, S. Dasu, V. Innocente CHEP06 TIFR, Mumbai