Upload
nguyendien
View
222
Download
0
Embed Size (px)
Citation preview
HPC Cineca Infrastructure:
State of the art and towards the exascale
Elda Rossi, [email protected]
Maurizio Cremonesi, [email protected]
Cineca in a nutshell
Cineca is a interuniversity consortium composed by 70 italianuniversities, several research institutions and the Ministry of Research.
Cineca is the largest italian supercomputing facility
Cineca headquarter is in Bologna (Casalecchio di Reno) and it hasoffices in Rome and Milan.
HPC department at Cineca
being the Italian HPC reference and staying competitive in the world
14in the Top500
3500
active users
3500
core-h distributed
1833
projects active in 2017
1276
new projects started in 2017
Directly involved in:
• 20 EU projects
• 40 agreements with Italian research Institutions
• 12 applied research projects with industrial partners
Cineca-HPC mission
The Cineca ecosystem
Cineca acts as a hub for innovation and research contributing to many scientificaland R&D projects on italian and european basis.
In particular, Cineca is a PRACE hosting member and a member of EUDAT.
HPC INFRASTRUCTURE: MARCONI
• Marconi is the new Tier-0 LENOVO system that
replaced the FERMI BG/Q in Jul 2016
• Marconi has been planned in three technological
stages in a 1 year and 1/2 programme. It reached 20
Pflop/s by the end of year 2017
• Marconi is a Lenovo NextScale system equipped with
Intel chips connected with an Intel OmniPath
network:
1) BDW, 2) many core KNL and 3) SKL processors.
• The 3 different partitions of Marconi are named A1,
A2 and A3.
• The A3 partition (5/7 Pflops) and part of the A2
partition (1/11 Pflops) are dedicated to EUROfusion
MARCONI A1 : Intel Broadwell
• In production since July 2016
• 792 compute nodes
• 2 sockets Intel(R) Xeon(R) CPU E5-2697 v4
@2.30 GHz, 18 cores: 36 cores/node
• 128GB RAM / node
• S.O. Linux Centos 7.2
• SLURM 17.11.3-2 batch scheduler
• TPP: 1 PFlop/s
MARCONI A2: Intel KNL
• In production since Jan 2017
• 3600 KnightsLanding compute nodes
• Intel Xeon Phi 7250 (68 cores) @1.40
GHz a.k.a. KNL
• 112GB RAM per node
• Configuration: Cache/Quadrant
• TPP: 11 PFlop/s
• 1/11 of this machine is dedicated to
EUROfusion as «accelerated» partition
MARCONI A3: Intel Skylake
• In production since Aug 2017
• Racks: 21 + 10
• Nodes: 1512 + 792
• Processors: 2 x 24-cores Intel Xeon
8160 CPU (Skylake) at 2.10 GHz � 48 core/node
• 72.576 + 38.016 cores in total
• RAM: 192 GB/node of DDR4
• TPP: 5 +2 PFlop/s
• 5/7 of this machine is dedicated to EUROfusion
MARCONI’s outlook
• Since the end of 2017 MARCONI in in its final configuration: 11 Racks were upgraded
from BDW to SKL, leading the system to 20 Pflop/s peak.
newCLOUD
newGALILEO
NEW!
NEW!
Up
gra
de
sin
20
18
D.A.V.I.D.E.
D.A.V.I.D.E.
Development of an Added-Value Infrastucture Designed in Europe
• PCP (Pre-Commercial Procurement) by PRACE
• OpenPOWER-based HPC cluster (45 nodes)
• 2xPower8 processors with NVLink bus + 4xNvidia Tesla
P100 SXM2
• Designed, integrated and tested by E4. Installation in
CINECA’s data center
• Available for research projects starting early 2018
Storage
Each system has its
“local” storage for HT
I/O.
• Home
• Cineca_scratch
• work
A long-term storage archive (connected with TAPEs) is available across systems (DRES)
Storage
Home: for source codes, executables, small data files.
Local on each HPC system, each user has an entry, pointed by $HOME env variable
Scratch: it is intended for the output of the batch jobs.
Local on each HPC system, each user has an entry, pointed by $CINECA_SCRATCH env variable
Work: it is intended for the output of the batch jobs, for secure sharing within the project team.
Local on each HPC system, each project has an entry , pointed by $WORK env variable
Storage
DRES: It is intended as a medium/long term repository and as a shared area within the project team and across HPC platforms.
Shared area on login-nodes of HPC systems, project area, pointed by $DRES env variable
Tape: it is intended as a personal long term archive area.
Shared area, each user has an entry, pointed by $TAPE envvariable (via LTFS)
Access to the system
1. Interactive access: ssh client
Linux: builtin command: ssh
Wins: get a client (putty, …)
2. Access via interface:
Web-based via webcompute.cineca.it
RCM: remote connection manager
3. Data Transfer:
sftp client
Globus GridFTP
Access to the system
ssh login.marconi.cineca.it
Last login: Wed Mar 14 09:21:32 2018 from pdl-mi-0-48.nat.cineca.it
*******************************************************************************
* Welcome to MARCONI / *
* MARCONI-fusion @ CINECA - NeXtScale cluster - CentOS 7.2! *
*
* For a guide on Marconi: *
* wiki.u-gov.it/confluence/display/SCAIUS/UG3.1%3A+MARCONI+UserGuide *
* For support: [email protected] *
*******************************************************************************
* This system is in its complete configuration and is in full-production *
===============================================================================
[mcremone@r000u06l01 ~]$
Batch mode: what does it mean
The computing servers are used by many users all the time but:
� each user would like he/she to be the only user of the system
� or at least that others do not interfere with his/her jobs
A way for automatically realizing this is by using a batch job management system
The batch manager:
• looks at the users’ jobs needs
• controls the available resources
• assign resources to each job
• in case put requests in a waiting batch queue
Batch mode: what does it mean
The batch system needs the following infos per each job:
• which resources (nodes, cores, accelerators, memory, …,
licences)
• for how much time
But the system administrator needs also to know who is paying
for the job.
So the user must bundle his/her job with all these information.
SLURM Workload Manager
SLURM stands for "Simple Linux Utility for Resource
Management"
open source and highly scalable job scheduling system
• Allocating access to resources
• Job starting, executing and monitoring
• Queue of pending jobs management
SLURM Workload Manager
cd $CINECA_SCRATCH/exec_dir
Write your script using an available
editor:
vi script
The script file has 2 sections:
• commands for the scheduler
(resources + Account_no)
commands for the system
(unix commands)
Submit the script to the scheduler
sbatch script
Wait … and check
squeue –l -u <username>
squeue –l -j <job_id>
The job completes: you can get final
results
ls -ltr
SLURM Workload Manager
#!/bin/bash
#SBATCH --nodes=1 # nodes
#SBATCH --ntasks-per-node=4 # MPI tasks/node
#SBATCH --cpus-per-task=4 # OpenMP threads/task
#SBATCH --time=1:00:00 # max 24:00:00
#SBATCH --mem=118GB # max memory/node
SLURM Workload Manager
#SBATCH --account=<your_project> # account name
#SBATCH --partition=XXX_usr_prod # partition name
#SBATCH --qos=<qos_name> # qos name
#SBATCH –job-name=myJob # job name
#SBATCH --error=errjobfile-%J.err # stderr file
#SBATCH --output=outjobfile-%J.out # stdout file
cd $SLURM_SUBMIT_DIR
module load …
… execution commandsSee our guide for more examples:https://wiki.u-gov.it/confluence/display/SCAIUS/UG2.5.1%3A+Batch+Scheduler+SLURM
SLURM Workload Manager
> sbatch myjob
> squeue -l -u mcremone
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REA
470455 knl_usr_p mm_mpiom mcremone R 18:47 1 r093c11s04
> scancel 470455
More info on our website:www.hpc.cineca.it � ForUser � Documentation � UserGuide
www.hpc.cineca.it � ForUser � Documentation � OtherDocuments � SLURM