Accelerate Machine Learning Leveraging HPC to...user Yes No Yes Yes Scratch Space /scratch 20 TB/1M...

Leveraging HPC to Accelerate Machine Learning

Weiguang GuanSHARCNet/CCguanw@sharcnet.ca

NextAI 2019

✅ To learn how to efficiently train DNN on HPC clusters

● Data I/O

● CPUs vs GPUs

● Multiple GPUs on a single node

● Distributed multi-GPU across multiple nodes

❌ Not a tutorial on

● Machine learning

NextAI 2019

Outline

● Introduction to HPC clusters in Canada

● Overview of distributed parallel NN training

● Regular NN training ⇨ distributed NN training on a single node○ Through different APIs: Keras, Estimator○ Performance benchmarks of using single/multiple CPUs/GPUs

● Distributed NN training across cluster (HOROVOD)

NextAI 2019

Compute Canada (CC)

NextAI 2019

Resources of Compute Canada

● Beluga (Calcul Quebec)

● Graham (SHARCNet)

● Cedar (Westgrid)

● Niagara (SciNet)

● Clouds○ East cloud○ Arbutus○ Graham cloud

NextAI 2019

Graham cluster operated by SHARCNet

● 884 base nodes (32 cpu cores, 125GB memory)

● 56 nodes (32 cpu cores, 250GB memory)

● 160 GPU nodes (32 cpu cores, 124GB memory, 2 nVidia Pascal

p100 GPUs)

● 7 AI GPU nodes ( 28 cpu cores, 178GB memory, 8 nVidia Volta

v100 GPUs)

NextAI 2019

Architecture of cluster

NextAI 2019

Computing environment

● Operating systems: Linux (64-bit CentOS)

● Languages: C/C++, Fortran, Matlab/Octave, Python, R, Java, etc.

● Parallel development support: MPI, pthreads, OpenMP, CUDA,

OpenACC, OpenCL

● Software modules: select pre-built and configured software, as

well as versions, with the module commands

● Job Scheduling: SLURM scheduler

NextAI 2019

File system Quotas Backed up? Purged? Available by Default?

Mounted on Compute Nodes?

Home Space

50 GB/0.5M files per

Yes No Yes Yes

Scratch

/scratch

20 TB/1M files per

user. Extendable to

100 TB

No Yes, all files

older than

60 days

Yes Yes

Project

/project

1 TB and 0.5M files

per group, can request

increase to 10 TB

Yes No Yes Yes

Local disk >960GB SSD No No Transient Per node

NextAI 2019

File systems

● /home (nfs)

● /project (lustre)

● /scratch (lustre)

● Fast local disk: $SLURM_TMPDIR (SSD or RAMDisk)

Reading data from one 16MB file on lustre is enormously faster

than from 400 40KB files

● 👍 Sequential I/O of large datasets

● 👎 I/O of many small datasets or random I/O of large datasets

NextAI 2019

Lustre file system

NextAI 2019

File striping in Lustre FS

NextAI 2019

File striping in Lustre FS (cont.)

Lfs getstripe <file_name>

Lfs setstripe --stripe_size=1m --stripe_count=2 <file_name>

NextAI 2019

Use local disk for data I/O

Local disks on compute nodes are SSD

● /project or /scratch space → $SLURM_TMPDIR

● Loading data from $SLURM_TMPDIR

● Computing

● Saving intermediate/final results to $SLURM_TMPDIR

● $SLURM_TMPDIR → /project or /scratch space

NextAI 2019

Job Scheduler --- Slurm

● sbatch abc.sh: to submit a job script

● squeue: to list the status of submitted jobs

● sacct : to show details of recent jobs

● scancel: to kill jobs

NextAI 2019

Job script example (abc.sh)

#!/bin/bash

#SBATCH --time=0-10:05 # Run time limit (DD-HH:MM)

#SBATCH --account=def-user # account

#SBATCH --ntasks=4 # Number of MPI processes, default 1

#SBATCH --cpus-per-task=32 # Normally defined for threaded jobs

#SBATCH --gres=gpu:2 # number of GPUs

#SBATCH --mem=120G # memory request

srum ./myprog

NextAI 2019

Distributed Parallel DNN training

● Overview

● Distributed DNN training on a single node

● Distributed DNN training across multiple nodes

NextAI 2019

NN training

● Feed forward

● Back propagation

● Apply gradients to update parameters (or variables) of the model

NextAI 2019

Two different parallelisms

● Data parallelism○ Data are divided and each chunk is sent to each device

● Model parallelism○ Model (graph) are divided, each partition is sent to each device

NextAI 2019

Synchronous data parallelism

NextAI 2019

Asynchronous data parallelism

NextAI 2019

Overview of distributed strategies

● Synchronous distributed training○ Mirrored ○ Central storage○ Multi-worker mirrored○ TPU (same as Mirrored except for its own all-reduce)

● Asynchronous distributed training○ Parameter server

NextAI 2019

● Data parallelism

● Model replica per

data ...

...Mirrored strategy

NextAI 2019

data ...

Central storage strategy

data ...

NextAI 2019

Each worker

● Computes gradient on subset of

image data

● Pushes the gradient to the server

● Pulls new parameter from server

Each server

● Aggregates gradients

● Updates parameters

Parameter server strategy

NextAI 2019

Your choice of distributed NN training

● Using one of the above strategies

● Using custom distributed one:

with tf.device(“GPU:0”) : …… Code block that is run on GPU:0 …… ……

with tf.device(“GPU:1”) : …… Code block that is run on GPU:1 ……

NextAI 2019

The mirrored strategy

● How to create NN with TF’s different APIs

● How to change them to use multiple GPUs on the same node

● Benchmarking○ Multiple CPUs○ Single GPU (p100, v100)○ Multiple GPUs on single node○ Distributed multiple GPUs across multiple nodes

NextAI 2019

Benchmarking

● Tool: Tensorflow○ 1.13○ 2.0 Beta (tf_upgrade_v2 --infile tensorfoo.py --outfile

tensorfoo-upgraded.py)

● Machine learning example○ Task: Recognition of handwritten digits○ Data: MNIST

NextAI 2019

Benchmarks on machine learning

● Using recognition of mnist handwritten digits as an example

NextAI 2019

Handwritten digits recognition codes

● low-level API

● Keras

● estimator

NextAI 2019

Benchmark of multiple CPUs

Batch size = 50Iterations = 20000

# of CPUs Time (sec)

1 2854.43

2 2295.09

4 1582.72

8 1153.90

16 762.10

32 1187.92

NextAI 2019

Benchmark of multiple CPUs (cont.)

# of CPUs Time (sec)

1 11193.94

2 8757.93

4 6645.16

8 3838.28

16 2559.46

32 2336.07

NextAI 2019

Benchmark of single GPU

Batch size = 256, Iterations = 20000

nVidia Pascal p100 GPU

● ~157 seconds (low-level API)

● ~133 seconds (Keras)

● ~185 seconds (estimator)

1 p100 GPU is ~15 times faster than 16 CPU cores (Intel E5-2683

v4 Broadwell)

NextAI 2019

Benchmark of single GPU (cont.)

Batch size = 256, Iterations = 20000

nVidia Pascal p100 GPU nVidia Volta v100 GPU

low-level API 157 103

Keras 135 128

Estimator 185 133

NextAI 2019

More GPUs we use,

the faster training will be?

NextAI 2019

Convert to Multi-GPU version

● low-level API

● Keras

● estimator

NextAI 2019

Benchmark of multiple GPUs (Keras)

# of GPUs

Time (sec)

p100 v100

1 135 128

2 161 164

NextAI 2019

Benchmark of multiple GPUs (Estimator)

# of GPUs

Time (sec)

p100 v100

1 185 133

2 252 231

8 338* Batch_size must be multiple of #GPUs

NextAI 2019

More GPUs we use,

the faster training will be?

No! Not always. Why?

NextAI 2019

Hypothesis: The computing task of the MNIST example (~3 million

trainable parameters) is far below the capacity of 1 GPU.

● Increase the scale of the NN

● Increase the batch size

NextAI 2019

Benchmark of multiple GPUs (Keras)

# of GPUs

Time (sec)

1 1203

# of GPUs

Time (sec)

NextAI 2019

Benchmark of multiple GPUs (Estimator)

# of GPUs

Time (sec)

1 2791

2 1246

# of GPUs

Time (sec)

NextAI 2019

Hypothesis: The computing task is far below the capacity of 1 GPU.

● Increase the scale of the NN

● Increase the batch size

NextAI 2019

ResNet-50

● 50 layers with default image size = 224 (23 million trainable

parameters)

● git clone https://github.com/tensorflow/models.git

● Training parameters○ Use synthetic data instead of real image data○ Batch_size = 128○ Number of GPUs = 1, 2, 4, 8

NextAI 2019

Benchmark of ResNet-50

Batch size = 128

# of GPUs

Examples/sec

8 1132

NextAI 2019

Orchestra in play

Horovod --- Multi-GPU multi-node parallelism by Uber AI team

● https://github.com/uber/horovod

● TF’s MultiWorkerMirroredStrategy

NextAI 2019

Horovod --- Multi-GPU Multi-node parallelism

NextAI 2019

Ratio of workers vs parameter servers

NextAI 2019

Baidu’s ring-allreduce (2017)

NextAI 2019

NCCL ring-allreduce

● NCCL 1: ring-allreduce across multiple GPUs on a node

● NCCL 2: ring-allreduce across multiple nodes

NextAI 2019

Tensor fusion

Combination of multiple small tensors into a big one

NextAI 2019

Horovod --- Multiple GPUs on multiple nodes

● Load python modules○ module load python/3.7 ○ Module load cuda cudnn○ module load scipy-stack/2019a

● Create python virtual environment○ export ENV=$HOME/nextai-env○ virtualenv --no-download $ENV○ source $ENV/bin/activate○ pip install --upgrade pip

NextAI 2019

Horovod --- Multiple GPUs on multiple nodes (cont.)

● Install Tensorflow in the virtual environment○ pip install tensorflow_gpu --no-index

● Download/install NCCL2○ https://developer.nvidia.com/nccl/nccl-download○ setrpaths.sh --path /path/to/nccl2 --add_origin

● Install Horovod in the virtual environment○ export HOROVOD_CUDA_HOME=$CUDA_HOME○ export HOROVOD_NCCL_HOME=/path/to/nccl2○ export HOROVOD_GPU_ALLREDUCE=NCCL○ pip install --no-cache-dir horovod

NextAI 2019

Horovod --- Multiple GPUs on multiple nodes (cont.)

● Clone TF benchmark repo:

○ git clone https://github.com/tensorflow/benchmarks.git

● Run the benchmarking script○ mpirun -np $((SLURM_NTASKS/16)) -bind-to core -map-by

slot:PE=16 -report-bindings -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python -u scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 32 --variable_update horovod

NextAI 2019

Benchmarks with ResNet50

Images/sec: (ResNet-50 with batch size=32 per GPU)

# of Nodes # of GPUs Images per second

1 2 322.93

2 4 629.01

4 8 1238.57

8 16 2443.49

16 32 4825.47

32 64 9318.20 (10143.52)

NextAI 2019

Benchmarks with ResNet50

NextAI 2019

Thank you!

Any questions?

Accelerate Machine Learning Leveraging HPC to...user Yes No Yes Yes Scratch Space /scratch 20 TB/1M...

Documents

Scratch Training Part Ileonghw/Outreach/2010-05-Scratch/Scratch-CXA-2010-P1.pdfOutline • About CXA Junior Category • Introduction to Scratch & Quick Demo • Scratch Training (Part

PowerTrace - lauterbach.com · ceva-xs1200 yes yes cortex-a15 yes yes yes yes cortex-a15mpcore yes yes yes yes cortex-a5 yes yes yes yes cortex-a5mpcore yes yes yes yes cortex-a7

Scratch - T.E.I. Kavaxis.teikav.edu.gr/.../scratch/Palaigeorgiou_L2.Scratch.pdf– Το Scratch υποστηρίζει διερεύνηση και επίλυση προβλημάτων

website-hosting-comparison - Ez Web ManifestingWebsite Hosting Comparison EZ Web Company Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes 150,000 files Yes Yes Yes Unlimited $9.99 BlueHost

V Forge Contents - Amazon Web Services...Scratch - stop Scratch - switchthaclock1 Scratch - take1 Scratch - take2 Scratch - thesebeats1 Scratch - thesebeats2 Scratch - thesebeats3

Author: Scratch! Scratch! Scratch!...Manu tries and tries not to SCRATCH! SCRATCH! She plays snakesand ladders. She dreams up many stories. She forgets to SCRATCH! SCRATCH! 11/13 Slowly,

37-B 38-A 38-B 40.A 100 500 250 100 Yes Yes Yes Yes No No Yes Yes No Yes No Yes Yes Yes Yes Yes yes Yes Yes Yes Yes No No No Yes No Yes Yes Yes Smt Shabana Parveen Sh_ Sh_ Masque Masjid

HCU DOUNGLEG610 G610-C00 Qualcomm MSM8625Q 2013 Yes Yes Yes - - Yes Yes Yes Yes Yes - Yes - Ascend G610 G610-T00 MTK MT6589 2012 Yes Yes - - - Yes - Yes Yes Yes - Yes - Ascend G610

From scratch... with Scratch

What is Scratch? Why use Scratch in the Classroom? Getting Started with Scratch? Scratch Examples Create

TRANSITIONAL PROVISIONS - …northexcastudycircle.com/Image/GST_TP_BY_CA_Anoop_Gupta.pdfLuxury Tax Yes Yes Yes Yes VAT Yes Yes Yes Yes Service tax Yes Yes Yes Yes Excise Yes Yes Yes

101-6-ctohobeads.net/common/pdf/101-6-c.pdf · rg-g62 la-goo tb-789 tb-12gd tb-e50 tb-123 tb-211 tb-2 tb-103 tb-162 tb-94b tb-94g tb-2110 tb-2156 tb-2c tb-1c3c tb-162c . tb-2152 tb-Ž152s

sbmjckgf.in 4.1.1/4.1.1_Any Additional Information.pdf · working condition yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes name of the equipment

Grupo Trabajo Scratch: Profundizando con Scratch

· M. Irfan No No no Yes agreed Yes Yes No no 15000 Yes Ashraf Ali Yes yes yes yes yes yes Yes Yes yes 32786 yes Feed Well Yes Yes Yes Yes agreed Yes Yes Yes Yes 26000 Yes Al-Wahid

TeamUnify...MELISSA GRISWOLD, HOLLIE Girls Pree Senior 24.Ø2 JERRY 187 .34 .38 .39 42 47 .51 69 95 31. 13 31. 31 .95 32. Scratch Scratch Scratch Scratch Scratch Scratch Scratch Scratch

OWB2ODI - smart- · PDF fileOWB Oracle OWB2ODI Modules Yes Yes Locations Yes Yes Data Objects table Yes Yes view Yes Yes materialized view Yes Yes external table Yes Yes file Yes Yes

PowerTrace - LAUTERBACH DEVELOPMENT · PDF file200 mhz powerpc ... efm32gg230f yes yes yes yes efm32gg232f yes yes yes yes efm32gg280f yes yes yes yes efm32gg290f yes yes yes yes efm32gg295f

Scratch Projektideen - ilearnit.chilearnit.ch/download/ScratchProjektideenScratch2.pdf · Scratch Projektideen Musikinstrumente mit Scratch 5 Worum geht es? Mit Scratch lassen sich

Scratch для детей. Самоучитель по программированию ... · Scratch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .