41
Introduction to Chimera Minerva Scientific Computing Environment https://hpc.mssm.edu Patricia Kovatch Eugene Fluder, PhD Hyung Min Cho, PhD Lili Gai, PhD Francesca Tartaglione, MS Dansha Jiang, PhD Sep 18, 2019

hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Introduction to ChimeraMinerva Scientific Computing Environment

https://hpc.mssm.edu

Patricia KovatchEugene Fluder, PhDHyung Min Cho, PhDLili Gai, PhDFrancesca Tartaglione, MSDansha Jiang, PhD

Sep 18, 2019

Page 2: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Outlines

● Chimera Computing Environments○ Chimera account and logging in○ Compute and storage resources○ User software environment ○ Service on data transfer and archive

● LSF (Load Sharing Facility): Job scheduler○ LSF introduction and basic LSF commands○ Queue structures○ Dependent job○ Self-scheduler ○ Parallel jobs

2

Page 3: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Chimera cluster @ Mount SinaiChimera Computes:

● 4x login nodes - Intel Skylake 8168 24C, 2.7GHz - 384 GB memory● 280 compute nodes - Intel 8168 24C, 2.7GHz

○ 13,440 cores (48 per node (2 sockets/node) - 192 GB/node)● 4x high memory nodes - Intel 8168 24C, 2.7GHz - 1.5 TB memory● 48 V100 GPUs in 12 nodes -

Intel 6142 16C, 2.6GHz - 384 GB memory - 4x V100-16 GB GPU

Chimera Storage:Total number of storage space: ~14PB including /sc/orga and /sc/hydra

● /sc/hydra : new file system on Chimera as primary storage○ Have the same folders as /sc/orga (work, projects, scratch).○ Use the system path environment variable in scripts

● /sc/orga is still mounted on Chimera○ Will migrate orga directories to hydra

3

Page 4: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Minerva/Chimera Architecture

4

Page 5: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Logging in - General

Chimera is a Linux machine with Centos 7.6● Linux is command line based, not GUI

● Linux was developed using TTY devices. Commands are short and many times cryptic, but there is usually a good reason

Logging in require an SSH client installed on your machine, a username, a memorized password, and a one-time use code obtained from a physical/software token

● SSH client: terminal ( Mac), Putty or MobaXterm( Windows)● Apply for an account at https://hpc.mssm.edu/acctreq/● Register your token at the Self Service Portal (https://register4vip.mssm.edu/vipssp/)● Logging info at https://hpc.mssm.edu/access/loggingin

5

Page 6: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Logging in - Linux / Mac

imac:~ gail01$ ssh -X [email protected]: Last login: Wed Sep 4 10:45:21 2019 from 10.91.17.183

=== Send ticket to [email protected] ===2019/09/18 2pm-3:30pm Training class will be offered at Icahn building L3-82

gail01@li03c04: ~ $ pwd/hpc/users/gail01

6

Connect to Chimera via ssh:

● ssh [email protected]● To display graphics remotely on your screen, pass the “-X” or “-Y” flag:

○ ssh -X [email protected]○ Mac: Install XQuartz on your mac first

● Open a terminal window on your workstation(Linux/Mac)○ Landed on one of the login nodes, and at your home directory

■ Never run jobs on login nodes■ For file management, coding, compilation, etc., purposes only

Page 7: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Logging in - Windows

● Install MobaXterm from https://mobaxterm.mobatek.net/

○ Enhanced terminal for Windows with X11 server, tabbed SSH client, network tools and much more

OR

● Install PuTTY from www.putty.org○ Google it. It will be the first hit.○ https://www.youtube.com/watch?v=ma6Ln30iP08

● If you are going to be using GUI’s○ In Putty: Connection > SSH > X11

■ Ensure “Enable X11 forwarding” is selected○ On Windows box install Xming

■ Google; Download; Follow bouncing ball○ Test by logging into Minerva and run the command: xclock

■ Should see a clock

7

Page 8: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Chimera Login summary

8

Users Login method Login servers Password Components

Sinai usersuser1user1+vkrb

@[email protected]@minerva14.hpc.mssm.edu

Sinai Password + 6 Digit Symantec VIP token code

External usersuser1+yipa HPC Password

+ YubiKey Button Push

Note: Load balancer Round-robin is configured for chimera.hpc.mssm.edu. It will distribute client connections across a group of login nodes.

4 new login nodes: minerva[11-14], which points to the login node li03c[01-04].

● minerva[11-12] (or li03c[01-02]) are external login nodes

- will be available when DMZ network is setup.

● minerva[13-14] (or li03c[03-04]) are internal login nodes)

- currently available via Minerva and MSSM campus.

○ All users: hop directly from Minerva internal login nodes (passwordless):

[jiangd03@minerva4 ~]$ ssh li03c03

[jiangd03@minerva4 ~]$ ssh li03c04

○ Sinai users: direct login by two factor authentication with any of the following combination:

■ Sinai network; Need VPN while off-campus

Page 9: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Chimera Storage● Storage is in folders and subfolders. In linux, subfolders are separated by “/”

● 4-ish folders you can have (Possibly multiple project folders)

9

Home /hpc/users/<userid>

• 20GB quota. • Slow. Use for “config” files,

executables…NOT DATA• NOT purged and is backed up

Work /sc/hydra/work/<userid>• 100GB quota• Fast, keep your personal data here• NOT purged but is NOT backed up

Scratch /sc/hydra/scratch/<userid>• Free for all, shared wild west• Current size is about 100TB• Purge every 14 days and limit per user is 15TB

Project /sc/hydra/projects/<projectid>$df -h /sc/hydra/projects/<projectid>

• PI’s can request project storage. • Need to submit an allocation request and

get approval from allocation committee https://hpc.mssm.edu/hpc-admin/forms/allocation-request

• Not backed up• Incurs charges $107/TiB/yr

Page 10: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

User Software Environment OS: Centos 7.6 with glibc-2.17(GNU C library) available. Some key packages:GCC: system default /usr/bin/gcc is gcc 4.8.5$ module load gcc ( default is 8.3.0)Python: default version 3.7.3$ module load python ( it will load python and all available python packages)R: default version 3.5.3$ module load R ( it will load R and all available R packages)Perl: default system version 5.16.3$module load CPANAnaconda3: default version 2018-12$module load anaconda3schrodinger: 2019-1$module load schrodingerMatlab access: $module load matlab

– The cost for the license is $100.00 per activation, and request form at https://mountsinai.formstack.com/forms/2017mathworksacademiclicenserenewal

10

Page 11: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

User Software Environment

Anaconda3:● Support minimal conda environments (such as tensorflow, pytorch, qiime)

e.g., tensorflow (both in CPU and GPU)$module load anaconda3 ( or anaconda2)$module load cuda$source activate tfGPU

● User should install their own envs locally, ➔ Use option -p PATH, --prefix PATH Full path to environment location (i.e. prefix).

$conda create python=3.x -p /sc/orga/work/gail01/conda/envs/myenv➔ Set envs_dirs and pkgs_dirs in .condarc file, specify directories in which environments and

packages are located

$conda create -n myenv python=3.x

11

$ cat ~/.condarc fileenvs_dirs:- /sc/orga/work/gail01/conda/envspkgs_dirs:- /sc/orga/work/gail01/conda/pkgs

Page 12: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

User Software Environment: LmodLmod Software Environment Module system implemented:

● Written in lua, but reads the TCL module files, and module command will all work● Search for all possible module: $ module avail or $ module spider

Check all available R versions$ ml spider R

…….R/3.3.1, R/3.4.0-beta, R/3.4.0, R/3.4.1, R/3.4.3_p, R/3.4.3, R/3.5.0, R/3.5.1_p, R/3.5.1, R/3.5.2,

R/3.5.3

● Load/unload a module - ml

● Autocompletion with tab ● module save: Lmod provides a simple way to store the currently loaded modules and restore

them later through named collections12

Page 13: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

User Software Environment - some config

● You can load modules in your .bashrc script to load them on startup

if [ $HOSTNAME != data2 ]; then

module load projectPI python

module load gcc

module load allocations

fi● You can create your own modules and modify MODULEPATH so they can

be found

export MODULEPATH=/hpc/users/fludee01/mymodules:$MODULEPATH

or

module use /hpc/users/fludee01/mymodules

13

Page 14: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Run applications by Containers: SingularitySingularity tool is supported, instead of docker (Security concern)

● Docker gives superuser privilege, thus is better at applications on VM or cloud infrastructureTo use singularity:

$ module load singularity

To pull a singularity image:$ singularity pull --name hello.simg shub://vsoch/hello-world

To pull a docker image:$singularity pull docker://ubuntu:latest

To run a singularity image:$ singularity run hello.simg # or, $ ./hello.simg

Note: /tmp and user home directory is automatically mounted into the singularity image. If you would like to get a shell with hydra mounted in the image, use command:

$ singularity run -B /sc/hydra/project/xxx hello.simg

To build a new image from recipe files: use Singularity Hub or your local workstation● Singularity build is not fully supported due to the sudo privileges for users● After registering an account on Singularity Hub, you can pull or upload your recipe, trigger the

singularity build and download the image after built.● Convert docker recipe files to singularity recipe files:

$ml python

$spython recipe Dockerfile Singularity 14

Page 15: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

File Transfer

● On Chimera: use login nodes (33h) or interactive nodes (12h)

● On Minerva: use data4.hpc.mssm.edu for copying files from the internet

● Globus online ( Preferred, when available):○ Minerva Endpoint: mssm#minerva○ More information at http://www.globusonline.org ○ Globus Connect Personal to make your laptop an endpoint

● SCP, SFTP, rsync:○ Good for relatively small files, not hundreds of TB's○ Some scp apps for Windows/Mac use cached password. This feature must be

turned off. ● Physical Transport:

○ We do support the copying of physical hard drives on the behalf of users

15

Page 16: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Archiving Data: TSM Overview ● Keep for 6 years with two copies

● Can be accessed via either a GUI or the command line○ or ○

● Works only on internal login/data nodes, i.e., login1, login2, minerva4, minerva14, and data4; NOT on external login node (i.e., minerva2)

● Large transfers can take a while. Use a screen session and disconnect to prevent time-outs

● Full more details at https://hpc.mssm.edu/docs/archiving

● Collaboration account:○ If your group is in need of a collaboration account for group related tasks like

archiving a project directory or managing group website, contact us at [email protected]

For more info. see https://hpc.mssm.edu/access/collaboration

16

$ module load java$ dsmj -se=userid $ dsmc -se= userid

Page 17: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Load Sharing Facility(LSF)

17

A Distributed Resource Management System

Page 18: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Distributed Resource Management System (DRMS)

● Goal is to achieve best utilization of resources and maximize throughput for high-performance computing systems

● Control usage of hard resources ○ CPU cycles; Memory;

● Can be decomposed into subsystems:○ Job management; Physical resource management; Scheduling and queuing

● Widely deployed DRMSs○ Platform Load Sharing Facility (LSF) ○ Portable Batch Systems (PBS) ○ Simple Linux Utility for Resource Management (Slurm) ○ others such as IBM Load Leveler and Condor

18

Page 19: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

LSF Job Lifecycle

1. submit a job

2. schedule the job

3. dispatch the job

19

https://www.ibm.com/

4. run the job

5. return output

6.send Email to client (disabled on Chimera)

Page 20: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

LSF Useful Commandsbhosts: Displays hosts and their static and dynamic resources

20

Page 21: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

bqueues: displays information about queues

21

bqueues -l interactiveQUEUE: interactive -- For interactive jobsPARAMETERS/STATISTICSPRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV 100 0 Open:Active - - - - 4 0 4 0 0 0Interval for a host to accept two jobs is 0 secondsDEFAULT LIMITS: RUNLIMIT 120.0 minMAXIMUM LIMITS: RUNLIMIT 720.0 min…….

Page 22: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

22

LSF: Queue structure in Chimera ( bqueues)Queue structure in Chimera

Queue priority/APS Wall time limit available resources

interactive 12 hours 4 nodes+1 GPU node

normal ( default queue) 100/APS 3 days 77 nodes

premium 200/APS 6 days 200 nodes + 2 high-mem

express 12 hours 280 nodes

long 100/APS 2 weeks 2 dedicated highmem nodes (96 cores)

GPU 100/APS 6 days 48 V100

private unlimited private nodes

*default memory : 3000MB / per core

Page 23: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

bsub - submit a job to LSF ( interactive and batch)Interactive jobs:

● This will set up an interactive environment on compute nodes for users ● Useful for testing and debugging jobs ● Interactive GPU is available for job testing

23

bsub -XF -P acc_hpcstaff -q interactive -n 1 -W 2:00 -R rusage[mem=3000] -Is /bin/bash● -q : to specify the queue-name from where to get the nodes● -Ip: Interactive terminal/shell ● -n : to specify the total number of compute cores ( job slot) needed● -R : Resource request specifying in a compute node● -XF: X11 forwarding● /bin/bash : the shell to use

gail01@li03c03: ~ $ bsub -XF -P acc_hpcstaff -q interactive -n 1 -W 2:00 -R rusage[mem=3000] -Is /bin/bashJob <2916837> is submitted to queue <interactive>.<<ssh X11 forwarding job>><<Waiting for dispatch ...>><<Starting on lc02a29>>

Page 24: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

bsub - batch job

24

Create job scripts containing job info and commands to the LSF bsub [options] < my_batch_job or bsub [options] my_batch_job

○ With “<” will interpret the #BSUB cookies in the script. ○ Options on the command line override what is in the script

gail01@li03c03: ~ $ cat myfirst.lsf#!/bin/bash#BSUB -J myfirstjob # Job name #BSUB -P acc_hpcstaff # allocation account #BSUB -q premium # queue#BSUB -n 1 # number of compute cores#BSUB -W 6:00 # walltime in HH:MM#BSUB -R rusage[mem=4000] # 4 GB of memory requested#BSUB -o %J.stdout # output log (%J : JobID)#BSUB -eo %J.stderr # error log#BSUB -L /bin/bash # Initialize the execution environment

module load gcc which gccecho “Hello Chimera”gail01@li03c03: ~ $ bsub < myfirst.lsf Job <2937044> is submitted to queue <premium>.

Submit the script with the bsub command:

bsub < myfirst.lsf

Page 25: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

bjobs - status of jobs● Check your own jobs: $bjobs

● Check all jobs: $bjobs -u all

● Long format with option -l

25

JOBID USER JOB_NAME STAT QUEUE FROM_HOST EXEC_HOST SUBMIT_TIME START_TIME

TIME_LEFT

2845103 beckmn01 *>junkK.432 RUN premium regen2 lc02e24 Sep 9 21:19 Sep 10 14:25 23:57 L 2845113 beckmn01 *>junkK.442 RUN premium regen2 lc02e24 Sep 9 21:19 Sep 10 14:26 23:58 L 2845088 beckmn01 *>junkK.417 RUN premium regen2 lc04a10 Sep 9 21:18 Sep 10 14:23 23:55 L 2845089 beckmn01 *>junkK.418 RUN premium regen2 lc04a10 Sep 9 21:18 Sep 10 14:23 23:55 L 2845090 beckmn01 *>junkK.419 RUN premium regen2 lc04a10 Sep 9 21:18 Sep 10 14:23 23:55 L 2845091 beckmn01 *>junkK.420 RUN premium regen2 lc04a10 Sep 9 21:18 Sep 10 14:23 23:55 L 2845092 beckmn01 *>junkK.421 RUN premium regen2 lc04a10 Sep 9 21:18 Sep 10 14:23 23:55 L 2845093 beckmn01 *>junkK.422 RUN premium regen2 lc04a10 Sep 9 21:18 Sep 10 14:23 23:55 L ……...

gail01@li03c03: ~ $ bjobs JOBID USER JOB_NAME STAT QUEUE FROM_HOST EXEC_HOST SUBMIT_TIME START_TIME TIME_LEFT 2937044 gail01 myfirstjob PEND premium li03c03 - Sep 10 14:38 - -

Page 26: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

bmod - modify submission options of pending jobs bmod takes similar options to bsub

● bmod -R rusage[mem=20000] <jobID>● bmod -q express <jobID>

26

bpeek <jobID>

bpeek - display output of the job produced so far

gail01@li03c03: ~ $ bpeek 2937044<< output from stdout >>“Hello Chimera”

<< output from stderr >>

gail01@li03c03: ~ $ bmod -q express 2937044 Parameters of job <2937044> are being changed

Page 27: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

bkill - kill jobs in the queue

Lots of ways to get away with murder

bkill <job ID>

Kill by job idbkill 765814

Kill by job namebkill -J myjob_1

Kill a bunch of jobsbkill -J myjob_*

Kill all your jobsbkill 0

27

Page 28: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

bhist - historical information

28

Page 29: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Dependent Job

Any job can be dependent on other LSF jobs. Syntaxbsub -w 'dependency_expression'usually based on the job states of preceding jobs.

bsub -J myJ < myjob.lsf

bsub -w 'done(myJ)' < dependent.lsf

29

Page 30: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Self-scheduler● Submit large numbers of independent serial jobs as a single batch

○ It is mandatory for short batch jobs less than ca. 10 minutes ○ These jobs put heavy load on the LSF server and will be killed

30

#!/bin/bash

#BSUB -q express

#BSUB -W 00:20

#BSUB -n 12

#BSUB -J selfsched

#BSUB -o test01

module load selfsched # load the selfsched module

mpirun -np 12 selfsched < test.inp # 12 cores, with one master process

$PrepINP < templ.txt > test.inp (InputForSelfScheduler)$cat templ.txt1 10000 2 F ← start, end, stride, fixed field length?/my/bin/path/Exec_# < my_input_parameters_# > output_#.log

$cat test.inp ( a series of job command)/my/bin/path/Exec_1 < my_input_parameters_1 > output_1.log

/my/bin/path/Exec_3 < my_input_parameters_3 > output_3.log.

/my/bin/path/Exec_9999 < my_input_parameters_9999 > output_9999.log

Page 31: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Job submission script example: selfsched.lsf#!/bin/bash#BSUB -J myMPIjob # Job name #BSUB -P acc_bsr3101 # allocation account #BSUB -q express # queue#BSUB -n 64 # number of compute cores#BSUB -R span[ptile=4] # 4 cores per node#BSUB -R rusage[mem=4000] # 256 GB of memory (4 GB per core)#BSUB -W 00:20 # walltime (30 min.)#BSUB -o %J.stdout # output log (%J : JobID)#BSUB -eo %J.stderr # error log#BSUB -L /bin/bash # Initialize the execution environment

echo "Job ID : $LSB_JOBID"echo "Job Execution Host : $LSB_HOSTS"echo "Job Sub. Directory : $LS_SUBCWD"

module load python module load selfschedmpirun -np 64 selfsched < JobMixPrep.inp > JonMixPrep.out 2>&1

31

Page 32: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Parallel Job

● Array job: Parallel analysis for multiple instances of the same program○ Execute on multiple data files simultaneously○ Each instance running independently

● Distributed memory program: Message passing between processes ( e.g. MPI)○ Processes execute across multiple CPU cores or nodes

● Shared memory program (SMP): multi-threaded execution (e.g. OpenMP)○ Running across multiple CPU cores on same node

● GPU programs: offloading to the device via CUDA

32

Page 33: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Array Job● Groups of jobs with the same executable and resource requirements, but

different input files. ○ -J “Jobname[index | start-end:increment]”○ Range of job index is 1~ 10,000

33

gail01@li03c03 $ bsub < myarrayjob.sh Job <2946012> is submitted to queue <express>.gail01@li03c03: ~ $ bjobs JOBID USER JOB_NAME STAT QUEUE FROM_HOST EXEC_HOST SUBMIT_TIME START_TIME TIME_LEFT 2946012 gail01 *rraytest[1] PEND express li03c03 - Sep 10 14:50 - - 2946012 gail01 *rraytest[2] PEND express li03c03 - Sep 10 14:50 - - 2946012 gail01 *rraytest[3] PEND express li03c03 - Sep 10 14:50 - - 2946012 gail01 *rraytest[4] PEND express li03c03 - Sep 10 14:50 - - 2946012 gail01 *rraytest[5] PEND express li03c03 - Sep 10 14:50 - - 2946012 gail01 *rraytest[6] PEND express li03c03 - Sep 10 14:50 - - 2946012 gail01 *rraytest[7] PEND express li03c03 - Sep 10 14:50 - - 2946012 gail01 *rraytest[8] PEND express li03c03 - Sep 10 14:50 - - 2946012 gail01 *rraytest[9] PEND express li03c03 - Sep 10 14:50 - - 2946012 gail01 *raytest[10] PEND express li03c03 - Sep 10 14:50 - -

#!/bin/bash#BSUB -P acc_hpcstaff#BSUB -n 1#BSUB -W 02:00#BSUB -q express#BSUB -J "jobarraytest[1-10]"#BSUB -o logs/out.%J.%I#BSUB -e logs/err.%J.%IWorking on file.$LSB_JOBINDEX

Page 34: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

MPI Jobs

● This example requests 48 cores or 2 hours in the "normal” queue.

34

#!/bin/bash#BSUB -J myjobMPI#BSUB -P YourAllocationAccount#BSUB -q normal#BSUB -n 48

#BSUB -W 02:00#BSUB -o %J.stdout#BSUB -eo %J.stderr#BSUB -L /bin/bash

cd $LS_SUBCWD

module load openmpi

mpirun -np 48 /my/bin/executable < my_data.in

Page 35: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Multithreaded Job - OpenMP● Multiple CPU cores within one node using shared memory

○ In general, a multithreaded application uses a single process which then spawns multiple threads of execution

○ It’s highly recommended the number of threads is set to the number of compute cores

● Your program needs to be written to use multi-threading

35

#!/bin/bash#BSUB -J myjob#BSUB -P YourAllocationAccount#BSUB -q alloc#BSUB -n 4#BSUB -R "span[hosts=1]”#BSUB -R rusage[mem=12000]#BSUB -W 01:00#BSUB -o %J.stdout#BSUB -eo %J.stderr#BSUB -L /bin/bash

cd $LS_SUBCWDexport OMP_NUM_THREADS=4 #sets the number of threads/my/bin/executable < my_data.in

Page 36: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Job submission script example: star.lsf#!/bin/bash#BSUB -J mySTARjob # Job name #BSUB -P acc_PLK2 # allocation account #BSUB -q premium # queue#BSUB -n 8 # number of compute cores#BSUB -W 12:00 # walltime in HH:MM#BSUB -R rusage[mem=4000] # 32 GB of memory (4 GB per core)#BSUB -R span[hosts=1] # all cores from one node#BSUB -o %J.stdout # output log (%J : JobID)#BSUB -eo %J.stderr # error log#BSUB -L /bin/bash # Initialize the execution environment

module load starWRKDIR=/sc/orga/projects/hpcstaff/benchmark_starSTAR --genomeDir $WRKDIR/star-genome --readFilesIn Experiment1.fastq --runThreadN 8 --outFileNamePrefix Experiment1Star

36

Submit the script with the bsub command:

bsub < star.lsf

Page 37: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

GPGPU (General Purpose Graphics Processor Unit)● GPGPU resources on Chimera

○ GPU interactive nodes

○ GPU batch nodes

37

#BSUB -q gpu#BSUB -n Ncpu

#BSUB -R v100#BSUB -R "gpu rusage[ngpus_excl_p=2]"

module purgemodule load anaconda3 ( or 2)module load cudasource activate tfGPU

python -c "import tensorflow as tf"

# submit to gpu queue# Ncpu is 1~32 on v100

# request specified gpu node v100# The number of GPUs requested per node ( 1 by default)

# to access tensorflow# to access the drivers and supporting subroutines

Partition V100 P100

number of nodes 12 1

GPU card 4 4

CPU cores 32 20 (Intel Broadwell)

host memory 384GB 128 GB

GPU memory 16 GB 12 GB

Page 38: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

LSF: summary on job submission examples

Interactive session:# interactive session

$ bsub -P acc_hpcstaff -q interactive -n 1 -W 00:10 -Is /bin/bash

# interactive GPU nodes, flag “-R v100” is required

$ bsub -P acc_hpcstaff -q interactive -n 1 -R v100 -R rusage[ngpus_excl_p=1] -W 01:00 -Is /bin/bash

Batch jobs submission:# simple standard job submission

$ bsub -P acc_hpcstaff -q normal -n 1 -W 00:10 echo “Hello World”

# GPU job submission if you don’t mind the GPU card model

$ bsub -P acc_hpcstaff -q gpu -n 1 -R rusage[ngpus_excl_p=1] -W 00:10 echo “Hello World”

# flag “-R v100” is required if you want to use certain GPU card (v100/p100)

$ bsub -P acc_hpcstaff -q gpu -n 1 -R v100 -R rusage[ngpus_excl_p=1] -W 00:10 echo “Hello World”

# himem job submission, flag “-R himem” is required

$ bsub -P acc_hpcstaff -q premium -n 1 -R himem -W 00:10 echo “Hello World”

38

Page 39: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Tips for efficient usage of the queuing system● Find appropriate queue and nodes

○ use -q interactive: for debug (both CPU and GPU)

○ use -q express if walltime < 12h

○ use himem node for memory intensive job

● Request reasonable resource○ Prior knowledge needed ( run test program and use top or others to monitor)○ Keep it simple

● Job not start after a long pending time○ Whether the resource requested is non-exist: -R rusage[mem = 10000] -n 20○ Run into PM: -W 144h

● If you see memory not enough○ Think about shared memory vs distributed memory job………○ Use -R span[hosts=1] where needed 39

gail01@li03c03: ~ $ bhosts -R himemHOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV lh03c02 ok - 48 2 2 0 0 0lh03c01 ok - 48 0 0 0 0 0lh03c04 ok - 48 15 15 0 0 0lh03c03 closed - 48 0 0 0 0 0

Page 40: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Final Friendly Reminder● Never run jobs on login nodes

○ For file management, coding, compilation, etc., purposes only

● Never run jobs outside LSF○ Fair sharing○ Scratch disk not backed up, efficient use of limited resources○ Old files will automatically be deleted without notification

● Logging onto compute nodes is no longer allowed.

● Follow us by visiting https://hpc.mssm.edu/, weekly update and twitter

● Acknowledge Scientific Computing at Mount Sinai should appear in your publications

40

Page 41: hpc.mssm.edu Introduction to Chimera Minerva Scientific ... · 9/18/2019  · Lmod Software Environment Module system implemented: Written in lua, but reads the TCL module files,

Last but not Least

[email protected]

41

Got a problem? Need a program installed? Send an email to: