Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Scalability / Data / TasksMeeting Scalability Requirements with Large Data and Complex Tasks: Adapting Existing Technologies and Best Practices in SloveniaJan Jona Javor ekšJo ef Stefan Institute ž [email protected] – Slovenian Initiative for National Grid
Jožef Stefan Institute
http://www.ijs.si/ http://www.sling.si/
3/29
Historical
CDC Cyber 74
CONVEX C3860
CONVEX C3860
Zuse Z 23
4/29
SLINGPRIKLJUČENI
CENTRIArctur* – 1024°Arnes – 4400°Atos* – 3000
CIPKeBiP – 990SiGNET – 4200
UNG – 120R4* – 1800°NSC – 1800°
PRIKLJUČENI CENTRI
Arctur* – 1024°Arnes – 4400°Atos* – 3000
CIPKeBiP – 990SiGNET – 4200
UNG – 120R4* – 1800°NSC – 1800°
8 sites
> 18.000 jeder
(> 11.000 ARC-active)
> 1PB disk
> 4 milion jobs / y
HPC, GPGPU, chroot
> 80% SLO capacity
CandidatesMeteo – 2200°
CI – 2000°ME – 1050°
CandidatesMeteo – 2200°
CI – 2000°ME – 1050°
5/29
SLING users● Arnes NREN users● Cluster owners*● Projects*● Individual researchers● University professors● Student groups
*not always ARC
6/29
Use Cases● Particle Physics:
– ATLAS
– Pierre Auger● Theoretical Physics
● Meteo/Geo Modelling
● Fluid Dynamics
● Reactor Physics Simulations
Pierre Auger Observatory
7/29
Use Cases● Life Sciences,
mostly computational (bio-)chemistryand genomics
– IJS users(biology, chemistry,knowledge technologies)
– Collaboration with EMBL
– Diagnostic genomics
– ELIXIR
8/29
Use Cases● Knowledge technologies
– Modelling for different fields
– Genetic alghoriths
– Big/Web data analyisis
– Advanced computationallinguistic models
– CLARIN.si
9/29
Steam explosion moment
10/29
Power distribution for Krsko NPP reactor
Parallel Monte Carlo simulation of neutron transport, F-8 department
11/29
Innovation?● batch system● virtualisation● network?
12/29
ARC and LRMS (batch system)
13/29
ARC Computing Element
14/29
ARC user accounts
15/29
Mix'n'match...CERN Agile modelCVMFS
gLite NorduGrid ARC
SLURMTorque
OpenStack
KeyStone
VOMSdCache
Puppet
OpenMPGlobus
science portalsoVirtOpenNebula
PKI
VRC
Cinder
gFTPGlance
SaltCeph OpenCL
CUDA
16/29
Software Deploymentand Virtualization● Admin install● Compile job● Install job● Shared disk● Shared image
● Environment Modules● Run Time Environments● CHROOTs● Containers● Docker● Shifter
17/29
Storage
●Basic suport●Short-term / local storage●Medium-term storage●Long term storage
18/29
User-Facing Issues● Batch / ARC interface / PKI / VOMS ● Software installations and use● Submission delays, error reporting and debugging● MPI scalability difficulties● Understanding of job and cluster topology● GPGPU use
19/29
Groups and Projects● Job and task management scalability● Data management → task managers● Storage and troughputh→ hardware and cluster setup● Oppurtunistic resource use● Resource optimization→ innovative job models
20/29
ATLAS as an example● ~100 distributed sites● 250k cores used all the time● 200PB of storage space ● 1M jobs/day● 2PB of data is transferred per day between computing sites● Sites include: WLCG GRID sites, HPCs, Clouds, Volunteer computing
21/29
aCT: ARC Control TowerComponents:● Submitter● Status checker● Fetcher● (app verification)● Cleaner
aCT
ARC&table
ARC&engineARC&configApp&config
App&engine
Site&1ARC&CECluster
Site&2ARC&CECluster
Site&3ARC&CECluster
App&table
External&job&provider
DB&(Oracle/MySQL)
22/29
Opportunistic Resouce Use● Grid clusters● HPC clusters● Private computers● Public (commercial) cloud● Microjobs
23/29
ATLAS scaling2010Planned data distributionJobs go to dataMulti-hop data flowsPoor T2 networking across regions
~20 AOD copies distributed worldwide
24/29
ATLAS scaling2010Planned data distributionJobs go to dataMulti-hop data flowsPoor T2 networking across regions
2013Planned & dynamic distribution data Jobs go to data & data to free sitesDirect data flows for most of T2sMany T2s connected to 10Gb/s link
~20 AOD copies distributed worldwide
4 AOD copies distributed worldwide
25/29
Social Component● Accessibility beyond large projects● Long-term funding ● Perception of public clouds● Not invented here syndrome● Users with no Unix experience● Sustainability pressure
26/29
People Involved
Andrej Filip i , č č JSIBarbara Kra ovecš , Arnes, JSIDejan Lesjak, JSIJanez Srakar, JSIJan Jona Javor ek, š JSI+ 4 site administrators
National Initiative:http://www.sling.si/
27/29
Thanks!
Questions?
28/29
New Computing Centre● 200 m² slightly dislocated● New network installation● Water cooling● Not enough power on-site yet● Housing Pikolit, NSC, parts of others● Interesting issues on cost sharing ...
29/29
New Cluster● Grid + HPC● GPGPU: 16 x K80● NorduGrid ARC + SLURM● Considering EGI● Users:
– IJS departments– related research– supported EU– infrastructures
NSC Cluster in Numbers
● ~1800 cores
● ~35 TB scratch
● ~35 TB storage
● ~8 TB RAM