31
ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Embed Size (px)

Citation preview

Page 1: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

ATLAS computing status in IHEP

Erming Pei, CC-IHEP Yangzhou, May 15’th 2009

Page 2: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Agenda

• Farm• Grid• Issues

• File System

Page 3: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Farm

Page 4: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Resource

• Old farm– Slc3: atlas02 + 8 Cores – Slc4: autilas + 16 Cores • will integrated to new farm

• New farm– atlasui02 + 128 Cores– New Server, New release (in testing)

Page 5: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

File System Size Used Avail % Mounted on

Storage

• HOME– AFS 8.6G 0 8.6G 0% /afs– 202.122.33.48:/home/atlas 932G 88G 845G 10% /ihepbatch/home-

atlas

• Software– bjlcg2.ihep.ac.cn:/data/exp_soft 1.1T 327G 791G 30% /ihepbatch/exp_soft– autilas.ihep.ac.cn:/opt/atlassw 29G 16G 13G 57% /opt/atlassw

• Data– 192.168.50.30:/atlas/data0 2.8T 543M 2.8T 1% /ihepbatch/atlasdata0– 192.168.50.30:/atlas/data1 3.7T 1.4T 2.3T 38% /ihepbatch/atlasdata1– 192.168.50.30:/atlas/data2 2.8T 1.3T 1.5T 47% /ihepbatch/atlasdata2– 192.168.50.30:/atlas/data3 3.1T 512K 3.1T 1% /ihepbatch/atlasdata3– 192.168.50.30:/atlas/data4 3.1T 512K 3.1T 1% /ihepbatch/atlasdata4– 192.168.50.30:/atlas/data5 3.1T 512K 3.1T 1% /ihepbatch/atlasdata5

Page 6: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Storage

Grid software repository

AFS

/hom

e/at

las

SE(DPM)

ATLAS Disk

Server

HOME

Local Data

ATLAS software Grid Data

atlasui02autilas

Torque/Maui

Page 7: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Software

• DQ2 enduser tools: /opt/atlassw/DQ2/endusers• Ganga: 5.1.10 (updated by Lianyou)

Page 8: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Job management• Server: Torque • Scheduler: Maui • Both are optimized

atlasui02autilas

Torque Server Maui

Page 9: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Job monitor

Page 10: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Local DPM Access

• T3 T2, DPM accessing failed – “rfio:/…”

• Reason:– Both Castor and DPM have rf* tools– use the same library: libshift.so

• Solution:– Link DPM library (libdpm.so) to Castor library

(libshift.so)

Page 11: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Tests with Athena 14.2.23

• Jobs: – Simulation jobs– Reconstruction jobs

• Tests:– Old farm– New farm– Front end – Back end– Interactive (directly on computing nodes)

Page 12: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Grid

Page 13: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

GangaRobot

Page 14: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Stress tests (GangaRobot)

Page 15: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Panda Jobs

Page 16: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Grid (Tier-2)

Page 17: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Disk Usage

Page 18: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Issues

• Many job failures in testing, a few succeeded • Conclusion: – I/O issue• Standardize job submitting operations• move data from HOME space to Data disks

– Most probably something wrong with the new batch system(the latest version, torque 2.4.1) • will change to other versions and test again.

– Next step• Separate Local software environment from Grid

Page 19: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Issues

AFS

/hom

e/at

las

SE(DPM)

ATLAS Disk

Server

HOME

Local Data

atlasui02autilas

Torque/Maui

Local Software

Grid software repository

NFS

Page 20: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Comments• Standardize your operations

– Put your input data to /atlas/datax1 or from DPM. – Submit jobs from /home/atlas/xxx

• afs space not support for batch jobs currently

– Put your output data to /atlas/datax2 – Please don't mix Home and Data space. – Add some debug sentences to your script

• e.g., Add 'hostname’ to your job script so that can know which node your job was running.

• Insert intervals when submit bulk jobs • Data space

– Public/Private– Public dataset classified by dataset name rather than by user name

Page 21: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

File System

Page 22: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

NFS

Page 23: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Luster MDS Server

Disk Server

Page 24: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

LUSTRE 压力测试(一)• 采用 600 个 BES 分析作业,运行 8 个小时,没有

出现问题,读性能稳定在 800MB/s

Page 25: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

LUSTRE 压力测试(二)• 采用 256 个 dd 写作业,同时运行一天,没有出

现问题,性能稳定在 350MB/s

Page 26: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

实际应用测试

Page 27: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

测试方法

• 在集群上设置两个测试专用队列 btq1,btq2 ,每个队列 300个 CPU ;每个队列中均有 2CPU , 4CPU , 8CPU 的计算结点

• 分别在两个队列上提交, 300 个, 250 个, 200 个, 150个, 100 个, 50 个分析作业

• 队列的分析作业分别对 LUSTRE 、 GPFS 文件系统中的数据文件进行分析计算(主要是读操作和少量写操作)

• 查看作业运行期间,计算结点的运行效率网络流量,以及文件服务器的网络流量

• 计算结点的运行效率取值参考 CPU USER 利用率

Page 28: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

测试结果- cpu 利用率8cpu CPU计算结点- 利用率

0

10

20

30

40

50

60

70

80

300-作业 250-作业 200-作业 150-作业 100-作业 50-作业

同时运行作业数

CPU利

用率

(%

l ust re- 8cpu

gpf s- 8cpu

Page 29: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

测试结果-网络流量8CPU计算结点网络流量统计

0

10

20

30

40

50

60

300-作业 250-作业 200-作业 150-作业 100-作业 50-作业

同时运行作业数

MByte/Sec

l ust re- 8cpu

gpf s- 8cpu

Page 30: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

结论• 在当前情况下, 150个分析作业同时运行效果较好-- CPU的利用率达到 60%以上。

• 推测:要满足 1500个分析作业同时高效运行,需要 30个左右文件服务器支持的并行文件系统

Page 31: ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Questions?