1
BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, P. R. China Introducti on System II: Cloud Computing Model System I: Jobs & Data MC Production with the System Problems and Plans BESIII is an experiment studying tau-charm physics by electron positron collider. 3 PB data has been generated in last 5 years. It has about 400 collaboration members come from 52 countries. The distributed computing system for BESIII has been established in 2012, bashed on DIRAC. Now it collected about 3000 CPU cores and 400 TB storage from 10 sites. Three large scale MC production tasks have been completed, with more than 150,000 jobs completed successfully. Raw data is taking at IHEP. IHEP is the central site for raw data processing, bulk reconstruction and analysis. Remote sites are for MC production and anlysis. Randomtrg data is transferred to remote sites by data transfer system between SEs. Simulation data produced in remote sites are directly written to IHEP SE by jobs. DST data is transferred to remote site for particular analysis • Problems: small sites are difficult to maintain robust SEs, where jobs have to download random trigger data from the center, which cause high load of central SE and inefficiency of jobs; frequent access of random trigger data in the split-by-event jobs cause heavy load of file system in remote sites; lack of man power in sites, monitoring is important for both cloud and normal sites; • Plans: study on cloud storage to provide a high- performance and robust central storage for common data access among sites; develop a user friendly monitoring system to ease usage and administrations; Batc h Time Type Event s Jobs 1 2012.11 Psi(3770) 200 M ~12,000 2 2013.09 Jpsi inclusive 800 M ~40,000 3 2013.12 Psi(3770) 1350 M ~100,00 0 Three large-scale MC production More than150k jobs are done 10 site joined the production Max running jobs reached 1400 About 15 TB output data are generated and transferred back to IHEP Job Mangement System DIRAC as middleware BESDIRAC as VO specific extension jobs are pulled to remote site by pilot agents user frontend is ganga with BOSS extension Data Management System use DFC for file and metadata catalog based dataset functionality a dataset-based transfer system is developed. The transfers reach 10TB/day when deploying random trigger data to remote site’s SE StoRM+Lustre is chosen to be the Central Storage Solution for integration of grid and local data Site CPU Cores SE Capacity Site CPU Cores SE Capacity IHEP.cn 264 214 TB UMN.us 768 50 TB UCAS.cn 152 JINR.ru 100~200 30 TB USTC.cn 200~1280 24 TB INFN- Torino.it 250 30 TB PKU.cn 100 SJTU.cn 100 WHU.cn 100~300 39 TB SDU.cn 150 Cloud resources are integrated into the system using VMDIRAC The virtual machines are scheduled by VMDIRAC and each virtual machine will build a DIRAC job execution environment automatically through contextualization Virtual machines are created if there are job requests. VMs are destroyed if there are no more jobs running Everything is transparent to the end user and user jobs are running properly on cloud sites 6 clouds joined the system, including 3 OpenStack and 3 OpenNebula clouds Tasks DIRAC Site Director VM Scheduler Pilot Pilot Cluste r Pilot VM Grid Cloud User

BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing

Embed Size (px)

Citation preview

Page 1: BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing

BESIII Production with Distributed ComputingXiaomei Zhang, Tian Yan, Xianghu Zhao

Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, P. R. China

Introduction System II: Cloud

Computing Model

System I: Jobs & Data

MC Productionwith the System

Problems and Plans

• BESIII is an experiment studying tau-charm physics by electron positron collider. 3 PB data has been generated in last 5 years.

• It has about 400 collaboration members come from 52 countries.

• The distributed computing system for BESIII has been established in 2012, bashed on DIRAC. Now it collected about 3000 CPU cores and 400 TB storage from 10 sites.

• Three large scale MC production tasks have been completed, with more than 150,000 jobs completed successfully.

• Raw data is taking at IHEP.• IHEP is the central site for raw data

processing, bulk reconstruction and analysis.

• Remote sites are for MC production and anlysis.

• Randomtrg data is transferred to remote sites by data transfer system between SEs.

• Simulation data produced in remote sites are directly written to IHEP SE by jobs.

• DST data is transferred to remote site for particular analysis

• Problems:• small sites are difficult to maintain robust SEs, where jobs have to

download random trigger data from the center, which cause high load of central SE and inefficiency of jobs;

• frequent access of random trigger data in the split-by-event jobs cause heavy load of file system in remote sites;

• lack of man power in sites, monitoring is important for both cloud and normal sites;

• Plans:• study on cloud storage to provide a high-performance and robust

central storage for common data access among sites;• develop a user friendly monitoring system to ease usage and

administrations;

Batch Time Type Events Jobs

1 2012.11 Psi(3770) 200 M ~12,000

2 2013.09 Jpsi inclusive 800 M ~40,000

3 2013.12 Psi(3770) 1350 M ~100,000

• Three large-scale MC production • More than150k jobs are done• 10 site joined the production• Max running jobs reached 1400 • About 15 TB output data are

generated and transferred back to IHEP

Job Mangement System• DIRAC as middleware• BESDIRAC as VO specific extension• jobs are pulled to remote site by

pilot agents• user frontend is ganga with BOSS

extension

Data Management System• use DFC for file and metadata catalog based dataset

functionality• a dataset-based transfer system is developed. The

transfers reach 10TB/day when deploying random trigger data to remote site’s SE

• StoRM+Lustre is chosen to be the Central Storage Solution for integration of grid and local data

Site CPU Cores SE Capacity Site CPU Cores SE Capacity

IHEP.cn 264 214 TB UMN.us 768 50 TB

UCAS.cn 152 JINR.ru 100~200 30 TB

USTC.cn 200~1280 24 TB INFN-Torino.it 250 30 TB

PKU.cn 100 SJTU.cn 100

WHU.cn 100~300 39 TB SDU.cn 150

• Cloud resources are integrated into the system using VMDIRAC

• The virtual machines are scheduled by VMDIRAC and each virtual machine will build a DIRAC job execution environment automatically through contextualization

• Virtual machines are created if there are job requests. VMs are destroyed if there are no more jobs running

• Everything is transparent to the end user and user jobs are running properly on cloud sites

• 6 clouds joined the system, including 3 OpenStack and 3 OpenNebula clouds

TasksDIRAC

Site Director VM Scheduler

PilotPilot

ClusterPilotVM

Grid

Cloud

User