HPC CinecaInfrastructure: State of the art and towardsthe ... · SLURM stands for "Simple Linux Utility for Resource Management" open source and highly scalable job scheduling system

HPC Cineca Infrastructure:

State of the art and towards the exascale

Elda Rossi, [email protected]

Maurizio Cremonesi, [email protected]

Cineca in a nutshell

Cineca is a interuniversity consortium composed by 70 italianuniversities, several research institutions and the Ministry of Research.

Cineca is the largest italian supercomputing facility

Cineca headquarter is in Bologna (Casalecchio di Reno) and it hasoffices in Rome and Milan.

HPC department at Cineca

being the Italian HPC reference and staying competitive in the world

14in the Top500

3500

active users

3500

core-h distributed

1833

projects active in 2017

1276

new projects started in 2017

Directly involved in:

• 20 EU projects

• 40 agreements with Italian research Institutions

• 12 applied research projects with industrial partners

Cineca-HPC mission

The Cineca ecosystem

Cineca acts as a hub for innovation and research contributing to many scientificaland R&D projects on italian and european basis.

In particular, Cineca is a PRACE hosting member and a member of EUDAT.

HPC INFRASTRUCTURE: MARCONI

• Marconi is the new Tier-0 LENOVO system that

replaced the FERMI BG/Q in Jul 2016

• Marconi has been planned in three technological

stages in a 1 year and 1/2 programme. It reached 20

Pflop/s by the end of year 2017

• Marconi is a Lenovo NextScale system equipped with

Intel chips connected with an Intel OmniPath

network:

1) BDW, 2) many core KNL and 3) SKL processors.

• The 3 different partitions of Marconi are named A1,

A2 and A3.

• The A3 partition (5/7 Pflops) and part of the A2

partition (1/11 Pflops) are dedicated to EUROfusion

MARCONI A1 : Intel Broadwell

• In production since July 2016

• 792 compute nodes

• 2 sockets Intel(R) Xeon(R) CPU E5-2697 v4

@2.30 GHz, 18 cores: 36 cores/node

• 128GB RAM / node

• S.O. Linux Centos 7.2

• SLURM 17.11.3-2 batch scheduler

• TPP: 1 PFlop/s

MARCONI A2: Intel KNL

• In production since Jan 2017

• 3600 KnightsLanding compute nodes

• Intel Xeon Phi 7250 (68 cores) @1.40

GHz a.k.a. KNL

• 112GB RAM per node

• Configuration: Cache/Quadrant

• TPP: 11 PFlop/s

• 1/11 of this machine is dedicated to

EUROfusion as «accelerated» partition

MARCONI A3: Intel Skylake

• In production since Aug 2017

• Racks: 21 + 10

• Nodes: 1512 + 792

• Processors: 2 x 24-cores Intel Xeon

8160 CPU (Skylake) at 2.10 GHz � 48 core/node

• 72.576 + 38.016 cores in total

• RAM: 192 GB/node of DDR4

• TPP: 5 +2 PFlop/s

• 5/7 of this machine is dedicated to EUROfusion

MARCONI’s outlook

• Since the end of 2017 MARCONI in in its final configuration: 11 Racks were upgraded

from BDW to SKL, leading the system to 20 Pflop/s peak.

newCLOUD

newGALILEO

NEW!

NEW!

Up

gra

de

sin

20

18

D.A.V.I.D.E.

D.A.V.I.D.E.

Development of an Added-Value Infrastucture Designed in Europe

• PCP (Pre-Commercial Procurement) by PRACE

• OpenPOWER-based HPC cluster (45 nodes)

• 2xPower8 processors with NVLink bus + 4xNvidia Tesla

P100 SXM2

• Designed, integrated and tested by E4. Installation in

CINECA’s data center

• Available for research projects starting early 2018

Storage

Each system has its

“local” storage for HT

I/O.

• Home

• Cineca_scratch

• work

A long-term storage archive (connected with TAPEs) is available across systems (DRES)

Storage

Home: for source codes, executables, small data files.

Local on each HPC system, each user has an entry, pointed by $HOME env variable

Scratch: it is intended for the output of the batch jobs.

Local on each HPC system, each user has an entry, pointed by $CINECA_SCRATCH env variable

Work: it is intended for the output of the batch jobs, for secure sharing within the project team.

Local on each HPC system, each project has an entry , pointed by $WORK env variable

Storage

DRES: It is intended as a medium/long term repository and as a shared area within the project team and across HPC platforms.

Shared area on login-nodes of HPC systems, project area, pointed by $DRES env variable

Tape: it is intended as a personal long term archive area.

Shared area, each user has an entry, pointed by $TAPE envvariable (via LTFS)

Access to the system

1. Interactive access: ssh client

Linux: builtin command: ssh

Wins: get a client (putty, …)

2. Access via interface:

Web-based via webcompute.cineca.it

RCM: remote connection manager

3. Data Transfer:

sftp client

Globus GridFTP

Access to the system

ssh login.marconi.cineca.it

Last login: Wed Mar 14 09:21:32 2018 from pdl-mi-0-48.nat.cineca.it

*******************************************************************************

* Welcome to MARCONI / *

* MARCONI-fusion @ CINECA - NeXtScale cluster - CentOS 7.2! *

*

* For a guide on Marconi: *

* wiki.u-gov.it/confluence/display/SCAIUS/UG3.1%3A+MARCONI+UserGuide *

* For support: [email protected] *

*******************************************************************************

* This system is in its complete configuration and is in full-production *

===============================================================================

[mcremone@r000u06l01 ~]$

Batch mode: what does it mean

The computing servers are used by many users all the time but:

� each user would like he/she to be the only user of the system

� or at least that others do not interfere with his/her jobs

A way for automatically realizing this is by using a batch job management system

The batch manager:

• looks at the users’ jobs needs

• controls the available resources

• assign resources to each job

• in case put requests in a waiting batch queue

Batch mode: what does it mean

The batch system needs the following infos per each job:

• which resources (nodes, cores, accelerators, memory, …,

licences)

• for how much time

But the system administrator needs also to know who is paying

for the job.

So the user must bundle his/her job with all these information.

SLURM Workload Manager

SLURM stands for "Simple Linux Utility for Resource

Management"

open source and highly scalable job scheduling system

• Allocating access to resources

• Job starting, executing and monitoring

• Queue of pending jobs management


cd $CINECA_SCRATCH/exec_dir

Write your script using an available

editor:

vi script

The script file has 2 sections:

• commands for the scheduler

(resources + Account_no)

commands for the system

(unix commands)

Submit the script to the scheduler

sbatch script

Wait … and check

squeue –l -u <username>

squeue –l -j <job_id>

The job completes: you can get final

results

ls -ltr


#!/bin/bash

#SBATCH --nodes=1 # nodes

#SBATCH --ntasks-per-node=4 # MPI tasks/node

#SBATCH --cpus-per-task=4 # OpenMP threads/task

#SBATCH --time=1:00:00 # max 24:00:00

#SBATCH --mem=118GB # max memory/node


#SBATCH --account=<your_project> # account name

#SBATCH --partition=XXX_usr_prod # partition name

#SBATCH --qos=<qos_name> # qos name

#SBATCH –job-name=myJob # job name

#SBATCH --error=errjobfile-%J.err # stderr file

#SBATCH --output=outjobfile-%J.out # stdout file

cd $SLURM_SUBMIT_DIR

module load …

… execution commandsSee our guide for more examples:https://wiki.u-gov.it/confluence/display/SCAIUS/UG2.5.1%3A+Batch+Scheduler+SLURM


> sbatch myjob

> squeue -l -u mcremone

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REA

470455 knl_usr_p mm_mpiom mcremone R 18:47 1 r093c11s04

> scancel 470455

More info on our website:www.hpc.cineca.it � ForUser � Documentation � UserGuide

www.hpc.cineca.it � ForUser � Documentation � OtherDocuments � SLURM