Choosing Resources Wisely - HARVARD UNIVERSITY

FAS Research Computing

Choosing Resources Wisely

Plamen Krastev Office: 38 Oxford, Room 117

Email:[email protected]


Inform you of available computational resources

Help you choose appropriate computational

resources for your research

Provide guidance for scaling up your applications

and performing computations more efficiently

More efficient use = more resources available to

do research

Enable you to “Work smarter, better, faster”

Objectives

Slide 2


Outline

Choosing computational resources

Overview of available RC resources

Partition / Queue

Time

Number of nodes and cores

Memory

Storage

Examples

Slide 3


What resources do I need?

Slide 4

Is my code serial or parallel?

How many cores and/or nodes does it need?

How much memory does it require?

How long does my code take to run?

How big is the Input / Output Data for each run?

How is the input data read by the code (e.g., hardcoded,

keyboard, parameter/data file(s), external

database/website, etc.)?


What resources do I need?

Slide 5

How is the output data written by the code (standard

output/screen, data file(s), etc.)?

How many tasks/jobs/runs do I need to complete?

What is my timeframe / deadline for the project (e.g., paper,

conference, thesis, etc.)?

What computational resources are available at Research

Computing?


RC resources: Odyssey

Slide 6

Odyssey is a large scale heterogeneous HPC cluster

Compute:

60,000+ compute cores (and increasing)

Cores per node: 8 to 64

Memory per node: 12GB to 512GB (4GB/core)

1,000,000+ NVIDIA GPU cores

Storage:

Over 35PB of storage

Home directories: 100GB

Lab space: Initial 4TB at $0 with expansion on a TB basis available for

purchase at $45/TB/year

Local scratch: 270GB/node

Global Scratch: High-performance shared scratch: 1 PB total, Lustre

file system

https://rc.fas.harvard.edu/resources/odyssey-storage







RC resources: Odyssey

Slide 7

Odyssey is a large scale heterogeneous HPC cluster

Software:

CentOS

SLURM job manager

1,000+ scientific tools and programs

https://portal.rc.fas.harvard.edu/apps/modules

Interconnect:

2 underlying networks connecting 3 data centers

TCP/IP network

Low-latency 56 GB/s InfiniBand network: inter-node parallel

computing, fast access to Lustre mounted storage

Hosted Machines:

• 300+ virtual machines

• Lab instrument workstations




Available Storage

Slide 8

Home Directories

Lab Storage Local Scratch Global Scratch Persistent

Research Data

Size Limit 100GB 4TB+ 270GB/node 1.2PB total 3PB

Availability All cluster nodes +

Desktop/laptop

All cluster nodes + Desktop/laptop

Local compute

node only.

All cluster nodes

Only IB connected

cluster nodes

Backup Hourly

snapshot + Daily Offsite

Daily Offsite No backup No backup External Repos

No backup

Retention Policy

Indefinite Indefinite Job duration 90 days 3-9 months

Performance Moderate. Not

suitable for high I/O

Moderate. Not suitable for

high I/O

Suited for small file I/O intensive jobs

Appropriate for large file I/O

intensive jobs

Appropriate for large I/O

intensive jobs

Cost Free 4TB Free +

Expansion at $45/TB/yr

Free Free Free


Partition / Queue

Slide 9

general serial_requeue interact bigmem unrestricted Lab

queues

Time Limit

7 days 7 days 3 days no limit no limit no limit

# Nodes 177 1071 8 7 8 1154

# Cores / Node

64 8-64 64 64 64 8-64

Memory / Node (GB)

256 12-512 256 512 256 12-512

Batch jobs:

#SBATCH -p general # Partition name

Interactive or test jobs:

srun -p interact OTHER_OPTIONS

https://rc.fas.harvard.edu/resources/running-jobs/#SLURM_partitions






Time

Slide 10

How long does my code take to run?

Batch jobs:

#SBATCH -p serial_requeue

#SBATCH -t 0-02:00 #Time in D-HH:MM


srun -t 0-02:00 -p interact OTHER_JOB_OPTIONS



Slide 11

Is my code serial or parallel?

Serial (single-core) jobs

Batch jobs:


#SBATCH -c 1 # Number of cores


srun -c 1 -p interact OTHER_JOB_OPTIONS

Core / Thread / Process / CPU



Slide 12

Parallel shared memory (single node) jobs

Examples:

• OpenMP (Fortran, C/C++)

• MATLAB Parallel Computing Toolbox (PCT)

• Python (e.g., threading, multiprocessing)

• R (e.g., multicore)

Batch jobs:

#SBATCH -p general # Partition

#SBATCH -N 1 # Number of nodes

#SBATCH -c 4 # Number of cores (per task)

srun -c 4 PROGRAM PROGRAM_OPTIONS


srun -p interact -N 1 -c 4 OTHER_OPTIONS



Slide 13

Parallel distributed memory (multi-node) jobs

Examples:

• MPI (openmpi, impi, mvapich) with Fortran or

C/C++ code

• MATLAB Distributed Computing Server (DCS)

• Python (e.g., mpi4py)

• R (e.g., Rmpi, snow)

Batch jobs:


#SBATCH -n 4 # Number of tasks

srun -n 4 PROGRAM PROGRAM_OPTIONS


srun -p interact -n 4 OTHER_OPTIONS


Memory

Slide 14

Serial and parallel shared memory (single node) jobs

Batch jobs:

#SBATCH -p serial_requeue # Partition

#SBATCH --mem=4000 # Memory / node in MB


srun --mem=4000 -p interact OTHER_OPTIONS

Parallel distributed memory (multi-node) jobs

Batch jobs:


#SBATCH -n 4 # Number of tasks

#SBATCH --mem-per-cpu=4000 # Memory / core in MB


srun --mem-per-cpu=4000 -n 4 -p interact OTHER_OPTIONS


Memory

Slide 15

How much memory does my code require?

• Understand your code and how the algorithms scale

analytically

• Run an interactive job and monitor memory usage

(with the “top” Unix command)

• Run a test batch job and check memory usage after

the job has completed (with the “sacct” SLURM

command)


Memory

Slide 16

Know your code

Example:

A real*8 (Fortran), or double (C/C++), matrix of dimension

100,000 X 100,000 requires ~80GB of RAM

Data Type: Fortran / C Bytes

integer*4 / int 4

integer*8 / long 8

real*4 / float 4

real*8 / double 8

complex*8 / float complex 8

complex*16 / double complex 16


Memory

Slide 17

Run an interactive job and monitor memory usage (with the “top” Unix

command)

Example: Check the memory usage of a matrix diagonalization code

Request an interactive bash shell session:

srun -p interact -n 1 -t 0-02:00 --pty --mem=4000 bash

Run the code, e.g.,

./matrix_diag.x

Open a new shell terminal and ssh to the compute node where the

interactive job dispatched, e.g.,

ssh holy2a18307

In the new shell terminal run top, e.g.,

top -u pkrastev


Memory

Slide 18

Run 1:

Matrix dimension = 3000 X 3000 (real*8)

Needs 3,000 X 3000 X 8 / 1000000 = ~72 MB of RAM


Memory

Slide 19

Run 2: Input size changed

Double matrix dimension, Quadrupole required memory

Matrix dimension = 6000 X 6000 (real*8)

Needs 6,000 X 6000 X 8 / 1000000 = ~288MB of RAM


sacct overview

• sacct = SLURM accounting database

– every 30 sec the node collects the amount of CPU

and memory usage that all of the process IDs are

using for a given job. After the job ends this data is

set to slurmdb.

• Common flags

– j jobid or –name=jobname

– S YYYY-MM-DD and –E YYYY-MM-DD

– o ouput_options

Slide 20

JobID,JobName,NCPUS,Nnodes,Submit,Start,End,CPUTime,TotalCPU,ReqMem,

MaxRSS,MaxVMSize,State,Exit,Node

http://slurm.schedmd.com/sacct.html




Memory

Slide 21

Run a test batch job and check memory usage after the job has

completed (with the “sacct” SLURM command)

Example:

[pkrastev@sa01 Resources]$ sacct -o ReqMem,MaxRSS -j 70446364

ReqMem MaxRSS

---------- ----------

320Mn 286648K

or

MaxRSS = 286648KB = 286.648MB

ReqMem = 320MB or 10% > MaxRSS

https://rc.fas.harvard.edu/resources/faq/how-to-know-what-memory-limit-to-put-on-my-job
























Storage

Slide 22

Home directories, /n/home*, and Lab storage are not appropriate for I/O

intensive or large number of jobs. Typical utilization would be jobscripts,

and in-house analysis codes or self-installed software

For jobs that create high-volume of small files (< 10 MB) , use local

scratch. You need to copy your input data to /scratch and move output

data to a different location after the job completes

For I/O intensive jobs – large data files (> 100 MB) and/or large number

of data files (100s of 10-100MB) – use the global scratch file-system

/n/regal

https://rc.fas.harvard.edu/policy-scratch








Storage

Slide 23

60 Oxford St

Initial Lab shares (4TB)

Legacy equipment

1 Summer Street

Personal home directories

Purchased lab shares

Older Lab owned compute nodes

Holyoke, MA

Global scratch high-performance file-

system

Compute nodes > 2012 (33K+ cores)

Topology may affect the efficiency of your

work! For best performance storage needs

to be closer to compute


Storage Utilization

Use “du” Unix command to check disk usage, e.g.,

du -h $HOME

...

37G /n/home06/pkrastev

Slide 24

https://en.wikipedia.org/wiki/Du_(Unix)




Examples

Slide 25

#!/bin/bash

#SBATCH -J lapack_test

#SBATCH -o lapack_test.out

#SBATCH -e lapack_test.err


#SBATCH -t 0-00:30

#SBATCH -N 1

#SBATCH -c 1

#SBATCH --mem=4000

# Load required modules

source new-modules.sh

# Run program

./lapack_test.x

Serial application


Examples

Slide 26

#!/bin/bash

#SBATCH -J omp_dot

#SBATCH -o omp_dot.out

#SBATCH -e omp_dot.err

#SBATCH -p general

#SBATCH -t 0-02:00

#SBATCH -N 1

#SBATCH -c 4

#SBATCH --mem=16000

# Set up environment


export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Run program

srun -c $SLURM_CPUS_PER_TASK ./omp_dot.x

Parallel OpenMP (single-node) application


Examples

Slide 27

#!/bin/bash

#SBATCH -J parallel_monte_carlo

#SBATCH -o parallel_monte_carlo.out

#SBATCH -e parallel_monte_carlo.err

#SBATCH -N 1

#SBATCH -c 8

#SBATCH -t 0-03:30

#SBATCH -p general

#SBATCH --mem=32000

# Load required software modules


module load matlab/R2016a-fasrc01

# Run program

srun -n 1 -c 8 matlab-default -nosplash -nodesktop -r "parallel_monte_carlo;exit"

MATLAB Parallel Computing Toolbox (single-node) application


Examples

Slide 28

#!/bin/bash

#SBATCH -J planczos

#SBATCH -o planczos.out

#SBATCH -e planczos.err

#SBATCH -p general

#SBATCH -t 30

#SBATCH -n 8

#SBATCH --mem-per-cpu=4000

# Load required modules


module load intel/15.0.0-fasrc01

module load openmpi/1.8.3-fasrc02

# Run program

srun -n 8 --mpi=pmi2 ./planczos.x

Parallel MPI (multi-node) application

https://github.com/fasrc/User_Codes




Test first

• Before diving right into submitting 100s or 1000s

of research jobs, ALWAYS test a few first.

– ensure the job will finish to completion without

errors

– ensure you understand the resources needs

and how they scale with different data sizes

and input options

Slide 29


Contact Information

Harvard Research Computing Website:

http://rc.fas.harvard.edu

Email:

[email protected]

[email protected]

Office Hours:

Wednesdays noon – 3pm

38 Oxford Street, 2nd Floor Conference Room Slide 30

http://rc.fas.harvard.edu/

mailto:[email protected]

mailto:[email protected]

Documents

Choosing Resources Wisely - HARVARD UNIVERSITY