18
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week – 11.07.2008

Statistics of CAF usage, Interaction with the GRID

Embed Size (px)

DESCRIPTION

Statistics of CAF usage, Interaction with the GRID. Marco MEONI CERN - Offline Week – 11.07.2008. Outline. CAF Usage and Users’ grouping Disk monitoring Datasets CPU Fairshare monitoring User query Conclusions & Outlook. CERN Analysis Facility. Cluster of 40 machines since two years - PowerPoint PPT Presentation

Citation preview

Page 1: Statistics of CAF usage,  Interaction with the GRID

Statistics of CAF usage, Interaction with the GRID

Marco MEONICERN - Offline Week – 11.07.2008

Page 2: Statistics of CAF usage,  Interaction with the GRID

Outline

CAF Usage and Users’ grouping

Disk monitoring

Datasets

CPU Fairshare monitoring

User query

Conclusions & Outlook

Page 3: Statistics of CAF usage,  Interaction with the GRID

CERN Analysis Facility

Cluster of 40 machines since two years80 CPUs, 8 TB of disk pool

35 machines as PRO partition, 5 as DEV

Head node is xrootd redirector and PROOF master

Other nodes are xrootd data servers and PROOF slaves

Page 4: Statistics of CAF usage,  Interaction with the GRID

Available resources in CAF must be fairly used

Highest attention to how disks and CPUs are used

Users are groupedAt present, sub-detectors and physics working groups

Users can belong to several groups (PWG has precedence over sub-detector)

Each grouphas a disk space (quota) which is used to stage datasets from AliEn

– has a CPU fairshare target (priority) to regulate concurrent queries

CAF Usage

Page 5: Statistics of CAF usage,  Interaction with the GRID

CAF GroupsGroups #Users Disk quota (GB) CPU quota (%)PWG0 5 1000 10

PWG1 1 1000 10

PWG2 21 1000 10

PWG3 8 1000 10

PWG4 17 1000 10

EMCAL 1 10

HMPID 1 10

ITS 3 10

T0 1 10

MUON 3 10

PHOS 1 10

TPC 2 10

TOF 1 10

ZDC 1 10

proofteam 5 100 10

testusers 40 10

marco 1 200 10

COMMON 1 2000 10

Not absolute quotas

– 18 registered groups– ~60 users– 165 users have used

CAF:please register to groups!

Page 6: Statistics of CAF usage,  Interaction with the GRID

Resource Monitoring

• ML ApMon running on each node– Sends monitoring information each minute

– Default monitoring (Load, CPU, memory, swap, disk I/O, network)

– Additional information:• PROOF and disk servers status (xrootd/olbd)

• Number of PROOF sessions (proofd master)

• Number of queued staging requests and hosted files (DS manager)

Page 7: Statistics of CAF usage,  Interaction with the GRID

Status Table

Page 8: Statistics of CAF usage,  Interaction with the GRID

lxb6047: 310lxb6048: 309lxb6049: 308lxb6050: 308lxb6051: 308lxb6052: 309lxb6053: 309lxb6054: 0lxb6055: 309lxb6056: 311lxb6057: 307lxb6058: 308lxb6059: 309lxb6060: 310lxb6061: 311lxb6062: 309lxb6063: 309lxb6064: 307lxb6065: 308lxb6066: 1089lxb6067: 309lxb6068: 311lxb6069: 309lxb6070: 313lxb6071: 311lxb6072: 309lxb6073: 312lxb6074: 312lxb6075: 310lxb6076: 311lxb6077: 309lxb6078: 307lxb6079: 312lxb6080: 309 ----- 10992

2081 1434 1491 1285 1510 1135 1548 1248 2200 1997 3017 1370 1486 2275 1105 657 1337 1254 2104 1595 1354 1077 3700 1236 1794 1147 1378 887 759 1422 1710 2859 1125 1088-----53665

lxb6047: 4505252lxb6048: 4765808lxb6049: 4535616lxb6050: 4626584lxb6051: 4611296lxb6052: 4357392lxb6053: 4618860lxb6054: 0lxb6055: 4617420lxb6056: 4616616lxb6057: 4604636lxb6058: 4616228lxb6059: 4498200lxb6060: 4503860lxb6061: 4615572lxb6062: 4442524lxb6063: 4648184lxb6064: 4506060lxb6065: 4617604lxb6066: 8205852lxb6067: 4616636lxb6068: 4610376lxb6069: 4503408lxb6070: 4621144lxb6071: 4617128lxb6072: 4503624lxb6073: 4614408lxb6074: 4578440lxb6075: 4617216lxb6076: 4575632lxb6077: 4508424lxb6078: 4502668lxb6079: 4503700lxb6080: 4408096 -------- 154294464

136503228 131620332 129713692 130692932 131562416 131304256 133114200 147588376 136338280 129738196 134240916 131261948 137193076 136599452 135547372 133563072 138580720 135901416 136091560 79725312 135399544 127204136 131962576 138608904 138949284 125406368 143095248 128861436 126755308 139379600 136096716 125332288 132440152 142560756----------4508933068

Hosted files and Disk Usage

#Raw files: 11k #Sim files: 54kRaw on disk:154GBSim on disk:4.5TB

Number of Files Disk Pool usage (Kb)

Raw data Sim data Raw data Sim data

ESDs from RAW dataproduction ready to bestaged

Page 9: Statistics of CAF usage,  Interaction with the GRID

• Datasets (DS) are used to stage files from AliEn• A DS is a list of files (usually ESDs or archives)

registered by users for processing with PROOF• DSs may share same physical files• Staging script issues new staging requests and

touch files every 5 mins• Files are uniformly distributed by the xrootd data

manager

Interaction with the GRID

Page 10: Statistics of CAF usage,  Interaction with the GRID

Dataset Manager

• The DS manager takes care of the quotas at file

level

• Physical location of files is regulated by xrootd

• The DS manager daemon sends:• The overall number of files • Number of new, touched, disappeared, corrupted files• Staging requests• Disk utilization for each user and for each group• Number of files on each node and total size

Page 11: Statistics of CAF usage,  Interaction with the GRID

Dataset Monitoring- PWG1 is using 0% of 1TB- PWG3 is using 5% of 1TB

Page 12: Statistics of CAF usage,  Interaction with the GRID

Datasets List• /COMMON/COMMON/ESD5000_part | 1000 | /esdTree | 100000 | 50 GB | 100 %• /COMMON/COMMON/ESD5000_small | 100 | /esdTree | 10000 | 4 GB | 100 %• /COMMON/COMMON/run15034_PbPb | 967 | /esdTree | 939 | 500 GB | 97 %• /COMMON/COMMON/run15035_PbPb | 962 | /esdTree | 952 | 505 GB | 98 %• /COMMON/COMMON/run15036_PbPb | 961 | /esdTree | 957 | 505 GB | 99 %• /COMMON/COMMON/run82XX_part1 | 10000 | /esdTree | 999500 | 289 GB | 99 %• /COMMON/COMMON/run82XX_part2 | 10000 | /esdTree | 922600 | 289 GB | 92 %• /COMMON/COMMON/run82XX_part3 | 10000 | /esdTree | 943100 | 288 GB | 94 %• /COMMON/COMMON/sim_160000_esd | 95 | /esdTree | 9400 | 267 MB | 98 %• /PWG0/COMMON/run30000X_10TeV_0.5T | 2167 | /esdTree | 216700 | 90 GB | 100 %• /PWG0/COMMON/run31000X_0.9TeV_0.5T | 2162 | /esdTree | 216200 | 57 GB | 100 %• /PWG0/COMMON/run32000X_10TeV_0.5T_Phojet | 2191 | /esdTree | 219100 | 83 GB | 100 %• /PWG0/COMMON/run33000X_10TeV_0T | 2191 | /esdTree | 219100 | 108 GB | 100 %• /PWG0/COMMON/run34000X_0.9TeV_0T | 2175 | /esdTree | 217500 | 65 GB | 100 %• /PWG0/COMMON/run35000X_10TeV_0T_Phojet | 2190 | /esdTree | 219000 | 98 GB | 100 %• /PWG0/phristov/kPhojet_k5kG_10000 | 100 | /esdTree | 1100 | 4 GB | 11 %• /PWG0/phristov/kPhojet_k5kG_900 | 97 | /esdTree | 2000 | 4 GB | 20 %• /PWG0/phristov/kPythia6_k5kG_10000 | 99 | /esdTree | 1600 | 4 GB | 16 %• /PWG0/phristov/kPythia6_k5kG_900 | 99 | /esdTree | 1100 | 4 GB | 11 %• /PWG2/COMMON/run82XX_test4 | 10 | /esdTree | 1000 | 297 MB | 100 %• /PWG2/COMMON/run82XX_test5 | 10 | /esdTree | 1000 | 297 MB | 100 %• /PWG2/akisiel/LHC500C0005 | 100 | /esdTree | 97 | 663 MB | 100 %• /PWG2/akisiel/LHC500C2030 | 996 | /esdTree | 995 | 4 GB | 99 %• /PWG2/belikov/40825 | 1355 | /HLTesdTree | 1052963 | 143 GB | 99 %• /PWG2/hricaud/LHC07f_160033DataSet | 915 | /esdTree | 91400 | 2 GB | 99 %• /PWG2/hricaud/LHC07f_160038_root_archiveDataSet| 862 | /esdTree | 86200 | 449 GB | 100 %• /PWG2/jgrosseo/sim_1600XX_esd | 33568 | /esdTree | 3293900 | 103 GB | 98 %• /PWG2/mvala/PDC07_pp_0_9_82xx_1 | 99 | /rsnMVTree | 990000 | 1 GB | 100 %• /PWG2/mvala/RSNMV_PDC06_14TeV | 677 | /rsnMVTree | 6442101 | 24 GB | 100 %• /PWG2/mvala/RSNMV_PDC07_09_part1 | 326 | /rsnMVTree | 2959173 | 5 GB | 100 %• /PWG2/mvala/RSNMV_PDC07_09_part1_new | 326 | /rsnMVTree | 2959173 | 5 GB | 100 %• /PWG2/pganoti/FirstPhys900Field_310000 | 1088 | /esdTree | 108800 | 28 GB | 100 %• /PWG3/arnaldi/PDC07_LHC07g_200314 | 615 | /HLTesdTree | 45000 | 787 MB | 94 %• /PWG3/arnaldi/PDC07_LHC07g_200315 | 594 | /HLTesdTree | 42600 | 744 MB | 95 %• /PWG3/arnaldi/PDC07_LHC07g_200316 | 366 | /HLTesdTree | 30700 | 513 MB | 99 %• /PWG3/arnaldi/PDC07_LHC07g_200317 | 251 | /HLTesdTree | 20100 | 333 MB | 100 %• /PWG3/arnaldi/PDC08_170167_001 | 1 | N/A | 33 MB | 0 %• /PWG3/arnaldi/PDC08_LHC08t_170165 | 976 | /HLTesdTree | 487000 | 4 GB | 99 %• /PWG3/arnaldi/PDC08_LHC08t_170166 | 990 | /HLTesdTree | 495000 | 4 GB | 100 %• /PWG3/arnaldi/PDC08_LHC08t_170167 | 975 | /HLTesdTree | 424500 | 8 GB | 87 %• /PWG3/arnaldi/myDataSet | 975 | /HLTesdTree | 424500 | 8 GB | 87 %• /PWG4/anju/myDataSet | 946 | /esdTree | 94500 | 27 GB | 99 %• /PWG4/arian/jetjet15-50 | 9817 | /esdTree | 973300 | 630 GB | 99 %• /PWG4/arian/jetjetAbove_50 | 94 | /esdTree | 8000 | 7 GB | 85 %• /PWG4/arian/jetjetAbove_50_real | 958 | /esdTree | 90500 | 73 GB | 94 %• /PWG4/elopez/jetjet15-50_28000x | 7732 | /esdTree | 739800 | 60 GB | 95 %• /PWG4/elopez/jetjet50_r27000x | 8411 | /esdTree | 793100 | 92 GB | 94 %

• Jury produced Pt specturm plots staging his own DS (run #40825, TPC+ITS, field on)

• Start staging common DSs of reconstructed runs?

• Jury produced Pt specturm plots staging his own DS (run #40825, TPC+ITS, field on)

• Start staging common DSs of reconstructed runs?

~4.7GB used out of 6GB (34*200MB -10%)

Page 13: Statistics of CAF usage,  Interaction with the GRID

• Usages retrieved each 5 mins, averaged each 6 hours• Compute new priorities applying a correction formula in [*quota..*quota]

100%

f(x) = q + q*exp(kx) k = 1/q*Ln(1/4)

10%

40%

quota (q)

priorityMin

priorityMax

0% 20%

CPU Fairshare

•α = 0.5, β = 2

usage

Page 14: Statistics of CAF usage,  Interaction with the GRID

• Priorities are used for CPU fairshare and converge to quotas

• Usages are averaged to gracefully converge to quotas• If no competition, users get max CPUs•Only relative priorities are modified!

Priority Monitoring

Page 15: Statistics of CAF usage,  Interaction with the GRID

CPU quotas in practice

- only PWGs + default groups - default usually has the highest usage

Page 16: Statistics of CAF usage,  Interaction with the GRID

Query Monitoring

– When a user query completes, PROOF master sends statistics:• Read bytes

• Consumed CPU time (base for CPU fairshare)

• Number of processed events

• User waiting time

– Values are aggregated per user and group

Page 17: Statistics of CAF usage,  Interaction with the GRID

accumulated

per interval

Query Monitoring

Page 18: Statistics of CAF usage,  Interaction with the GRID

Outlook• User sessions monitoring

– in average 4-7 sessions in parallel (daily hours, EU time), with peek of 15-20 users during the tutorial sessions: running history missing

– need to monitor #workers per user when load-based scheduling will be introduced

• Additional monitoring per single query (disk used and Files/sec not implemented yet)

• Network– traffic correlation among nodes

– Xrootd activity with the new bulk staging requests

• Debug– Tool to monitor and kill a hanging session when Reset doesn’t work (need to restart the

cluster)

• Hardware– New ALICE MAC cluster “ready” (16 workers)

– New IT 8-core machines coming

• Training– PROOF/CAF is the key setup for interactive user analysis (and more)

– Number of people attending the monthly tutorial is increasing (20 persons last week!)