Upload
briana-atkinson
View
238
Download
3
Embed Size (px)
Citation preview
BESIII Production with Distributed ComputingXiaomei Zhang, Tian Yan, Xianghu Zhao
Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, P. R. China
Introduction System II: Cloud
Computing Model
System I: Jobs & Data
MC Productionwith the System
Problems and Plans
• BESIII is an experiment studying tau-charm physics by electron positron collider. 3 PB data has been generated in last 5 years.
• It has about 400 collaboration members come from 52 countries.
• The distributed computing system for BESIII has been established in 2012, bashed on DIRAC. Now it collected about 3000 CPU cores and 400 TB storage from 10 sites.
• Three large scale MC production tasks have been completed, with more than 150,000 jobs completed successfully.
• Raw data is taking at IHEP.• IHEP is the central site for raw data
processing, bulk reconstruction and analysis.
• Remote sites are for MC production and anlysis.
• Randomtrg data is transferred to remote sites by data transfer system between SEs.
• Simulation data produced in remote sites are directly written to IHEP SE by jobs.
• DST data is transferred to remote site for particular analysis
• Problems:• small sites are difficult to maintain robust SEs, where jobs have to
download random trigger data from the center, which cause high load of central SE and inefficiency of jobs;
• frequent access of random trigger data in the split-by-event jobs cause heavy load of file system in remote sites;
• lack of man power in sites, monitoring is important for both cloud and normal sites;
• Plans:• study on cloud storage to provide a high-performance and robust
central storage for common data access among sites;• develop a user friendly monitoring system to ease usage and
administrations;
Batch Time Type Events Jobs
1 2012.11 Psi(3770) 200 M ~12,000
2 2013.09 Jpsi inclusive 800 M ~40,000
3 2013.12 Psi(3770) 1350 M ~100,000
• Three large-scale MC production • More than150k jobs are done• 10 site joined the production• Max running jobs reached 1400 • About 15 TB output data are
generated and transferred back to IHEP
Job Mangement System• DIRAC as middleware• BESDIRAC as VO specific extension• jobs are pulled to remote site by
pilot agents• user frontend is ganga with BOSS
extension
Data Management System• use DFC for file and metadata catalog based dataset
functionality• a dataset-based transfer system is developed. The
transfers reach 10TB/day when deploying random trigger data to remote site’s SE
• StoRM+Lustre is chosen to be the Central Storage Solution for integration of grid and local data
Site CPU Cores SE Capacity Site CPU Cores SE Capacity
IHEP.cn 264 214 TB UMN.us 768 50 TB
UCAS.cn 152 JINR.ru 100~200 30 TB
USTC.cn 200~1280 24 TB INFN-Torino.it 250 30 TB
PKU.cn 100 SJTU.cn 100
WHU.cn 100~300 39 TB SDU.cn 150
• Cloud resources are integrated into the system using VMDIRAC
• The virtual machines are scheduled by VMDIRAC and each virtual machine will build a DIRAC job execution environment automatically through contextualization
• Virtual machines are created if there are job requests. VMs are destroyed if there are no more jobs running
• Everything is transparent to the end user and user jobs are running properly on cloud sites
• 6 clouds joined the system, including 3 OpenStack and 3 OpenNebula clouds
TasksDIRAC
Site Director VM Scheduler
PilotPilot
ClusterPilotVM
Grid
Cloud
User