NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

Sylvain Jeaugey

NCCL 2.0

2

DEEP LEARNING ON GPUS Making DL training times shorter

Multi-core CPU GPU

CUDA

Multi-GPU

NCCL 1

Multi-GPU

Multi-node

NCCL 2

Deeper neural networks, larger data sets … training is a very, very long operation !

3

NCCL A multi-GPU communication library

PCIe

NVLink

Sockets (Ethernet)

Infiniband, with GPU Direct RDMA

To other systems

Within a system

GPU Direct P2P

4

NCCL Architecture

NCCL

CUDA

Caffe

CUBLAS

Caffe2 Torch TF MXNET CNTK

Deep Learning Frameworks

NVIDIA GPUs

CUDNN

5

AGENDA

NCCL History Design

NCCL 2.0 New features API Changes

Performance

Future

6

HISTORY

Q4 2015: NCCL 1.x

Open-source research project on github, helping Deep Learning frameworks compute on multiple GPUs with efficient collective operations.

Limited to intra-node.

Q2 2017: NCCL 2.x and beyond

NVIDIA Library, multi-node support and improved API.

7

DESIGN

Optimized collective communication library between CUDA devices.

Easy to integrate into any DL framework, as well as traditional HPC apps using MPI.

Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory, parallel reductions, NVLink usage.

Operates on CUDA pointers. Operations are tied to a CUDA stream.

Uses as little threads as possible to permit other computation to progress simultaneously.

What is NCCL ?

8

DESIGN Rings

NCCL uses rings to move data across all GPUs and perform reductions.

9

DESIGN Rings


PCIe / QPI : 1 unidirectional ring

10

DESIGN Rings


DGX-1 : 4 unidirectional rings PCIe / QPI : 1 unidirectional ring

11

Reduction

DESIGN Kernels

sendbuff recvbuff

FIFO

Next GPU

in the ring

Previous GPU

in the ring

12

NCCL 2.0

13

NCCL 2.0

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.

Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.

Inter-node communication

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

14

NCCL 2.0

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.

Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.

Inter-node communication

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

15

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node N-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7 4 GPUs per

socket

2 sockets

per node

n nodes

16

NCCL 2.0



Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node n-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

1 process

per GPU P

0

P

1

P

2

P

3

P

4

P

5

P

6

P

7

P

8

P

9 P10

P11

P12

P13

P14

P15

P8n

-8

P 8n

-7

P 8n

-6

P 8n

-5

P 8n

-4

P 8n

-3

P 8n

-2

P 8n

-1

17

NCCL 2.0



Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node n-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Process 0

t0 t1 t2 t3

Process 1

t0 t1 t2 t3

Process 2

t0 t1 t2 t3

Process 3

t0 t1 t2 t3

Process 2n-2

t0 t1 t2 t3

Process 2n-1

t0 t1 t2 t3

1 thread

per GPU

1 process

per socket

18

NCCL 2.0



Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node n-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

1 process

per node Process 0 Process 1 Process n-1

8 GPUs per

process

19

NCCL 2.0 API

NCCL 2.0 is introducing mandatory new verbs ncclGroupStart/ncclGroupEnd when managing multiple devices from a single thread

NCCL 1.x : for (int i=0; i<ngpus; i++) {

cudaSetDevice(devices[i]);

ncclAllReduce(…, comms[i], streams[i]);

}

NCCL 2.0 : ncclGroupStart();

for (int i=0; i<ngpus; i++) {

ncclAllReduce(…, comms[i], streams[i]);

}

ncclGroupEnd();

Group calls

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Process 0

20

NCCL 2.0 API

Inter-node communicator creation still uses the NCCL 1.x verbs : ncclGetUniqueId/ncclCommInitRank if (rank == 0) ncclGetUniqueId(&id)

My_Bcast(&id);

ncclCommInitRank(&comm, nranks, id, rank);

Multi-process + multi-GPU per process (from a single thread) : combine ncclCommInitRank with ncclGroupStart/ncclGroupEnd if (rank == 0) ncclGetUniqueId(&id)

My_Bcast(&id);

ncclGroupStart();

for (int i=0; i<ndev; i++) {

cudaSetDevice(devices[i]);

ncclCommInitRank(&comm, ndev*nranks, id, ndev*rank+i);

}

ncclGroupEnd();

Integration with parallel environments

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

P

0

P

1

P

2

P

3

P

4

P

5

P

6

P

7

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Process 0

21

NCCL 2.0 API

Other small API adjustments over the NCCL 1.x API :

Counts are now of type size_t instead of int

allGather arguments order has been fixed to be similar to other operations

Additions/clarification on datatypes : integral : int8 = char, uint8, int32 = int, uint32, int64, uint64 floating point : float16 = half, float32 = float, float64 = double

Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations

Others

22

PERFORMANCE

23

PERFORMANCE Intra-node performance

0

10

20

30

40

50

60

4 QPI 4 CPU 4 PCI DGX-1

AllReduce bandwidth (OMB, size=128MB, in GB/s)

24

PERFORMANCE Inter-node performance

0

5

10

15

20

25

30

35

40

45

2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

AllReduce bandwidth (OMB, size=128MB, in GB/s)

MPI

Baidu Allreduce

NCCL

25

PERFORMANCE Deep Learning - CNTK

0

1000

2000

3000

4000

5000

6000

7000

8000

0 8 16 24 32

CNTK scaling ResNet50, images/s

Ideal MPI NCCL

217

1684

3281

6569

1645 1744

3360

26

FUTURE

27

FUTURE

Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions)

User-defined reduction operations also, trying to merge computation and communication better

Windows support

Please let us know your needs ! Connect with experts / NCCL session : Wed Apr 10, 4pm

Top asked features

Documents

NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design