34
Nurcan Ozturk University of Texas at Arlington SCHOOL ON HEP@TR-GRID April 30 – May 2, 2008 Turkish Atomic Energy Authority (TAEA), Ankara, Turkey Distributed Analysis With pathena

Nurcan Ozturk University of Texas at Arlington SCHOOL ON HEP@TR-GRID April 30 – May 2, 2008

Embed Size (px)

DESCRIPTION

Nurcan Ozturk University of Texas at Arlington SCHOOL ON HEP@TR-GRID April 30 – May 2, 2008 Turkish Atomic Energy Authority (TAEA), Ankara, Turkey. Distributed Analysis With pathena. Outline. Part I – Information on pathena Introduction How pathena works - PowerPoint PPT Presentation

Citation preview

Page 1: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

Nurcan Ozturk

University of Texas at Arlington

SCHOOL ON HEP@TR-GRID

April 30 – May 2, 2008

Turkish Atomic Energy Authority (TAEA), Ankara, Turkey

Distributed Analysis With pathena

Page 2: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 2

Outline

Part I – Information on pathena Introduction

How pathena works

What type of jobs pathena can run

pathena usage

What happens when submitting jobs

pathena options

Monitoring pathena jobs

Bookkeeping & retry

User support

Part II – pathena Tutorial

Based on “Distributed Analysis on Panda” Twiki page:https://twiki.cern.ch/twiki/bin/view/Atlas/DAonPanda

Page 3: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

Part I – Information on pathena

Page 4: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 4

Introduction

PanDA = Production ANd Distributed Analysis system pathena: Client tool for PanDA

Submits user-defined jobs A consistent user-interface to Athena users. Works on Athena runtime

environment. Runs at the sites in OSG and LCG.

Requirements to run pathena jobs: Athena

Any release version Kit or AFS

GRID User Interface (UI) LCG UI VDT NG UI

Join ATLAS VO All ATLAS VO members

Page 5: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 5

Job(production or analysis)

output dataset (user dataset)

filefile

transformationtransformation

filefile

input dataset (official or user dataset)

filefile

filefile

No essential difference between production jobs and analysis jobs. Analysis jobs run on the same infrastructure as production jobs. The

infrastructure is always maintained by the production operations team. User dataset can be accessed via DDM (using DQ2 end-user tools). Analysis jobs and production jobs run on separated computing-resources:

Analysis queues: Short queue (wall-time limit = 90min) Long queue (wall-time limit = 8 hours)

Analysis jobs don’t have to compete with production jobs.

How pathena Works - Datasets

Page 6: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 6

What Type of Jobs pathena Can Run

pathena can run all kinds of Athena jobs: All production steps

Event generation Simulation, Pileup Digitization Reconstruction Merge Analysis

Arbitrary package configuration Add new packages Modify cmt/requirements in any package

Customize source code and/or jobOption Multiple-input streams

For instance signal + minimum-bias

TAG/AANT-based analysis Protection against corrupted/missing files

Page 7: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 7

pathena Usage

When running athena:

$ athena MyJobOptions.py

All you need to do is:

$ pathena MyJobOptions.py –-inDS inputDataset --outDS outputDataset

Nothing special to submit your Athena job to GRID using pathena. athena -> pathena Add inDS and outDS

The user doesn’t have to modify jobOption file when submitting jobs. Jobs go to data – you need to know where your data is. Analysis jobs

don’t trigger data transfer across GRIDs.

Page 8: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 8

pathena Options

pathena [--inDS InputDataset] [--outDS OutputDataset] [--minDS MinimumBiasDataset] [--cavDS CavernDataset] [--split N]

[--site SiteName] [--nfiles N] [--nFilesPerJob N] [--nEventsPerJob N]

[--nSkipFiles N] [--official] [--extFile files] [--libDS LibraryDataset] [--long] [--blong] [--nobuild] [--memory MemorySize] [--tmpDir tmpDirName]

[--shipInput] [--fileList files] [--addPoolFC files] [--skipScan]

[--removeFileList filename] [--inputFileList filename] [--inputType types]

[-p bootstrap] [-c command] <jobOption1.py> [<jobOption2.py> [...]]

Please see what you can do using these options on the twiki page:https://twiki.cern.ch/twiki/bin/view/Atlas/DAonPanda#options

Page 9: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 9

What Happens When Submitting Jobs

• archive user's work directory • send the archive to Panda • extract job configuration from jobOs • define jobs automatically • submit jobs

builds the athena environment at the remote site. It produces a library dataset.

runs athena and produces the output files

Page 10: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 10

Monitoring pathena Jobs

http://pandamon.usatlas.bnl.gov:25880/server/pandamon/query

Page 11: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 11

Bookkeeping & Retry

pathena has utilities to see the status/details of the jobs submitted and retry the failed ones for instance:

$ pathena_util

>>> show()

======================================

JobID : 8239

time : 2008-04-29 03:29:07

inDS : fdr08_run1.0003051.StreamEgamma.merge.AOD.o1_r6_t1

outDS : user.NurcanOzturk.HighPtView.StreamEgamma.AtlasInAnkara

libDS : user.NurcanOzturk.lxplus205_0.lib._008239

build : 10676339

run : 10676340

jobO : -c "Mode=['FullReco'];DetailLevel=['FullStandardAOD'];Branches= ['StacoTauRec']" MyJobOptions.py

site : ANALY_SWT2_CPB

>>> status(8239)

>>> retry(8239)

>>> kill(8239)

>>> help()

Press Ctl-D to exit

Page 12: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 12

User Support

PanDA Savannah page – report problems/errors:

https://savannah.cern.ch/projects/panda/

PanDA/pathena hypernews forum: discussions, sharing experience, helping each other:

https://hypernews.cern.ch/HyperNews/Atlas/get/pandaPathena.html

See the production shift elog (electronic logbook) for system wide or site level problems, maintenances, outages. It is linked from main PanDA monitor page:

http://atlas003.uta.edu:8080/ADCoS/?mode=summary

Page 13: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

Part II - Pathena Tutorial

Available at:http://www.usatlas.bnl.gov/twiki/bin/view/AtlasSoftware/PathenaOnFDRData

Page 14: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 14

Goal

Learn how to submit an analysis job: Setup athena

Check out PandaTools package (for pathena)

Use HighPtView package as an analysis package

Find the data (we will run on FDR data)

Find out which analysis queue will be used

Submit a pathena job

Monitor job’s status in PanDA monitor

Get the output of the job and make plots

Page 15: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 15

Setup Athena and Work Area

Instructions are given to run on lxplus machines at CERN Create a directory (called AtlasInAnkara) and get the requirements file from next

page Make a sub-directory for 13.0.40 (called 13.0.40) under AtlasInAnkara Setup CMT:

source /afs/cern.ch/sw/contrib/CMT/v1r20p20070208/mgr/setup.sh cmt config

Setup athena for release 13.0.40: source setup.sh -tag=13.0.40,32,groupArea (32 is complier version gcc323)

Check out Tools/Scripts package to setup your work area (easy way of checking out and compiling multiple packages) cd 13.0.40 cmt co -r Scripts-00-01-14 Tools/Scripts

Setup work area and create run area: ./Tools/Scripts/share/setupWorkArea.py cd WorkArea/cmt cmt bro cmt config cmt bro gmake source setup.sh

Page 16: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 16

Example File - requirements

#############################################################set CMTSITE CERNset SITEROOT /afs/cern.chmacro ATLAS_DIST_AREA ${SITEROOT}/atlas/software/dist

macro ATLAS_GROUP_AREA "/afs/cern.ch/atlas/groups/PAT/Tutorial/EventViewGroupArea/EVTags-13.0.40.323"

apply_tag simpleTestapply_tag oneTest

macro ATLAS_TEST_AREA "" \ 13.0.40 "${HOME}/public/AtlasInAnkara/13.0.40"

use AtlasLogin AtlasLogin-* $(ATLAS_DIST_AREA)############################################################

Page 17: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 17

Check Out Necessary Packages

Check out PandaTools for pathena: cd to 13.0.40 directory cmt co PhysicsAnalysis/DistributedAnalysis/PandaTools

Check out HighPtView package: cmt co –r HighPtView-00-01-10 PhysicsAnalysis/HighPtPhys/HighPtView

Check out EventViewConfiguration: cmt co –r EventViewConfiguration-00-01-13

PhysicsAnalysis/EventViewBuilder/EventViewConfiguration

Page 18: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 18

Compile and Make a jobOption File

Run every time new package(s) checked out: ./Tools/Scripts/share/setupWorkArea.py

It prints:################################################################################WorkAreaMgr : INFO Creating a WorkArea CMT package under: [/afs/cern.ch/user/n/nozturk/public/AtlasInAnkara/13.0.40]WorkAreaMgr : INFO Scanning [/afs/cern.ch/user/n/nozturk/public/AtlasInAnkara/13.0.40]WorkAreaMgr : INFO Found 4 packages in WorkAreaWorkAreaMgr : INFO => 0 package(s) in suppression listWorkAreaMgr : INFO Generation of WorkArea/cmt/requirements done [OK]WorkAreaMgr : INFO ################################################################################

Compile PandaTools package from WorkArea: cd WorkArea/cmt cmt bro cmt config cmt bro gmake source setup.sh

Go to run area and get the jobOption file from HighPtView package: cd ../run get_files HighPtViewNtuple_topOptions.py

Make a jobOption file for details of the job, called MyJobOption.py See next page for an example file

Page 19: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 19

Example File – MyJobOptions.py

import os

print os.environ["CMTPATH"]

InserterConfiguration={} # Always need this lineInserterConfiguration["Electron"]={} # Need such for every item you will modifyInserterConfiguration["Electron"]["FullReco"]=[{"Name":"ElMedium"}]

#DoTrigger=TrueTriggerView=Trueinclude("HighPtView/HighPtViewNtuple_topOptions.py")include("AthenaPoolCnvSvc/ReadAthenaPool_jobOptions.py")ServiceMgr.PoolSvc.SortReplicas=Truefrom DBReplicaSvc.DBReplicaSvcConf import DBReplicaSvcServiceMgr+=DBReplicaSvc()ServiceMgr.DBReplicaSvc.UseCOOLSQLite=False# fix for stream and DPDs by AttilaInserterConfiguration.update({ "CommonParameters": { "DoPreselection":False, "CheckOverlap":False } })

Page 20: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 20

Setup Grid and DQ2, Find FDR Datasets

Setup Grid: source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh

Setup DQ2: source /afs/cern.ch/atlas/offline/external/GRID/ddm/endusers/setup.sh.CERN

Look at available FDR datasets at Tier2’s from Panda monitor: http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?mode=listFDR Pick up one dataset:

fdr08_run1.0003051.StreamEgamma.merge.AOD.o1_r6_t1

One can also list the replicas for a given dataset: source /afs/usatlas.bnl.gov/Grid/Don-Quijote/DQ2_0_3_client/dq2.sh dq2-list-dataset-replicas fdr08_run1.0003051.StreamEgamma.merge.AOD.o1_r6_t1 INCOMPLETE: DESY-ZNINCOMPLETE: DESY-ZN

COMPLETE: BNLXRDHDD1,SARA-MATRIX_DATADISK,RAL-LCG2_DATADISK,IN2P3-COMPLETE: BNLXRDHDD1,SARA-MATRIX_DATADISK,RAL-LCG2_DATADISK,IN2P3-CC_DATADISK,RALPP,SLACXRD,LIP-LISBON,TAIWAN-LCG2_DATADISK,NDGF-CC_DATADISK,RALPP,SLACXRD,LIP-LISBON,TAIWAN-LCG2_DATADISK,NDGF-T1_DATADISK,IFICDISK,WISC,TOKYO-T1_DATADISK,IFICDISK,WISC,TOKYO-LCG2_DATADISK,MWT2_IU,LIV,ICL,PIC_DATADISK,BU_DDM,TIER0TAPE,INFN-LCG2_DATADISK,MWT2_IU,LIV,ICL,PIC_DATADISK,BU_DDM,TIER0TAPE,INFN-T1_DATADISK,DESY-HH,JINR,CYF,IJST2,TRIUMF-LCG2_DATADISK,FZK-T1_DATADISK,DESY-HH,JINR,CYF,IJST2,TRIUMF-LCG2_DATADISK,FZK-LCG2_DATADISK,TORON,PNPI,AGLT2_SRM,BNL-OSG2_DATADISK,SWT2_CPB,LNF,TW-LCG2_DATADISK,TORON,PNPI,AGLT2_SRM,BNL-OSG2_DATADISK,SWT2_CPB,LNF,TW-FTT,OU,MWT2_UCFTT,OU,MWT2_UC

Page 21: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 21

Name Association Between DDM and Analysis Queue Names

DDM Name Analysis Queue Name

SWT2_CPB ANALY_SWT2_CPB

OU ANALY_OU_OCHEP_SWT2

AGLT2_SRM ANALY_AGLT2

MWT2_UC ANALY_MWT2

SLACXRD ANALY_SLAC

BU_DDM ANALY_NET2

WISC ANALY_GLOW-ATLAS

Analysis queues available in the US. For more queues see next page.

Page 22: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 22

Analysis Queues from Panda Monitor

Page 23: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 23

Run pathena (1)

Run pathena with one line command:

$ pathena -c "Mode=['FullReco'];DetailLevel=['FullStandardAOD']; Branches= ['StacoTauRec']" MyJobOptions.py --inDS fdr08_run1.0003051.StreamEgamma.merge.AOD.o1_r6_t1 --outDS user.NurcanOzturk.HighPtView.StreamEgamma.AtlasInAnkara --nfiles 1 --site ANALY_SWT2_CPB

HighPtView options: Mode=['FullReco'];DetailLevel=['FullStandardAOD']; Branches= ['StacoTauRec']"

pathena options: Specify input dataset by --inDS Specify output dataset by --outDS Specify # of files to be run on by --nfiles 1 Specify the analysis queue name by --site siteName

More pathena options are available at: https://twiki.cern.ch/twiki/bin/view/Atlas/DAonPanda#synopsis

Page 24: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 24

Run pathena (2)

The following will be printed on the screen:

Your identity: /DC=org/DC=doegrids/OU=People/CN=Nurcan Ozturk 155817Enter GRID pass phrase for this identity:Creating proxy ........................................... DoneYour proxy is valid until: Tue Apr 29 15:24:55 2008extracting run configurationConfigExtractor > No InputConfigExtractor > Output=AANT EVAANtupleDump0Stream AANT0archive sourcesarchive InstallAreapost sources/jobOquery files in dataset:fdr08_run1.0003051.StreamEgamma.merge.AOD.o1_r6_t1submit=================== JobID : 8239 Status : 0 > build PandaID=10676339 > run PandaID=10676340

builds the athena environment at the remote site.It produces a library dataset.

runs athena and produces the output files

Page 25: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 25

Monitor Job’s Status in PanDA Monitor (1)

Page 26: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 26

Monitor Job’s Status in PanDA Monitor (2)

Go to “List users” link at the right top corner of PanDA monitor:http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?ui=users&sort=latest

Page 27: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 27

Monitor Job’s Status in PanDA Monitor (3)

Page 28: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 28

Monitor Job’s Status in PanDA Monitor (4)

Page 29: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 29

Examine Log Files In Case Of Problems

Page 30: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 30

Retrieve Results and Make Plots

Use dq2 client tools to retrieve the output dataset: dq2_get –rv user.NurcanOzturk.HighPtView.StreamEgamma.AtlasInAnkara

This copies the output files: user.NurcanOzturk.HighPtView.StreamEgamma.AtlasInAnkara.AANT0._00001.root

user.NurcanOzturk.HighPtView.StreamEgamma.AtlasInAnkara._10676340.log.tgz

Open the file in root and make some plots: root user.NurcanOzturk.HighPtView.StreamEgamma.AtlasInAnkara.AANT0._00001.root

root [1] FullRec0->GetListOfLeaves()->Print();

root [2] FullRec0->Draw("El_N", "El_N>0");

root [3] FullRec0->Draw("El_p_T", "El_N>0");

root [4] FullRec0->Draw("Jet_C4_N", "Jet_C4_N>0");

root [5] FullRec0->Draw("Jet_C4_p_T", "Jet_C4_N>0");

Page 31: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 31

Some Plots – Number of Electrons and Transverse Momentum of Electrons

Page 32: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 32

HighPtView DPD’s Made From FDR-1 Data

Alden and Amir at UTA made DPD’s using HighPtView package on all FDR data for SWT2 physics analysis groups.

You can get them by dq2_get if you are interested in looking at: dq2_ls user.AldenStradling.fdr08*HPTV_NOR (overlap removal off)

dq2_ls user.AldenStradling.fdr08*HPTV_OR (overlap removal on)

Page 33: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 33

Future Developments with pathena

Automatic redirection of analysis jobs within a cloud. Namely, no need to specify site - pathena will choose the best site based on data availability and available CPU's.

Page 34: Nurcan Ozturk University of Texas at Arlington SCHOOL ON  HEP@TR-GRID  April 30 – May 2, 2008

May 2, 2008May 2, 2008Nurcan OzturkNurcan Ozturk 34

References

FDR datasets available at Tier2’s: http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?mode=listFDR

pathena wiki page “Distributed Analysis on Panda”: https://twiki.cern.ch/twiki/bin/view/Atlas/DAonPanda

How to submit same pathena job on multiple datasets: https://twiki.cern.ch/twiki/bin/view/Atlas/

DAonPanda#example_6_re_submit_the_same_ana

HighPtView wiki page: https://twiki.cern.ch/twiki/bin/view/Atlas/HighPtView

Wiki pages by Akira Shibata on FDR Analysis: https://twiki.cern.ch/twiki/bin/view/Atlas/TopFDR

https://twiki.cern.ch/twiki/bin/view/Atlas/TopFdrPanda