28
Sylvain Jeaugey NCCL 2.0

NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

Sylvain Jeaugey

NCCL 2.0

Page 2: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

2

DEEP LEARNING ON GPUS Making DL training times shorter

Multi-core CPU GPU

CUDA

Multi-GPU

NCCL 1

Multi-GPU

Multi-node

NCCL 2

Deeper neural networks, larger data sets … training is a very, very long operation !

Page 3: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

3

NCCL A multi-GPU communication library

PCIe

NVLink

Sockets (Ethernet)

Infiniband, with GPU Direct RDMA

To other systems

Within a system

GPU Direct P2P

Page 4: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

4

NCCL Architecture

NCCL

CUDA

Caffe

CUBLAS

Caffe2 Torch TF MXNET CNTK

Deep Learning Frameworks

NVIDIA GPUs

CUDNN

Page 5: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

5

AGENDA

NCCL History Design

NCCL 2.0 New features API Changes

Performance

Future

Page 6: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

6

HISTORY

Q4 2015: NCCL 1.x

Open-source research project on github, helping Deep Learning frameworks compute on multiple GPUs with efficient collective operations.

Limited to intra-node.

Q2 2017: NCCL 2.x and beyond

NVIDIA Library, multi-node support and improved API.

Page 7: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

7

DESIGN

Optimized collective communication library between CUDA devices.

Easy to integrate into any DL framework, as well as traditional HPC apps using MPI.

Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory, parallel reductions, NVLink usage.

Operates on CUDA pointers. Operations are tied to a CUDA stream.

Uses as little threads as possible to permit other computation to progress simultaneously.

What is NCCL ?

Page 8: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

8

DESIGN Rings

NCCL uses rings to move data across all GPUs and perform reductions.

Page 9: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

9

DESIGN Rings

NCCL uses rings to move data across all GPUs and perform reductions.

PCIe / QPI : 1 unidirectional ring

Page 10: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

10

DESIGN Rings

NCCL uses rings to move data across all GPUs and perform reductions.

DGX-1 : 4 unidirectional rings PCIe / QPI : 1 unidirectional ring

Page 11: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

11

Reduction

DESIGN Kernels

sendbuff recvbuff

FIFO

Next GPU

in the ring

Previous GPU

in the ring

Page 12: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

12

NCCL 2.0

Page 13: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

13

NCCL 2.0

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.

Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.

Inter-node communication

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

Page 14: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

14

NCCL 2.0

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.

Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.

Inter-node communication

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

Page 15: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

15

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node N-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7 4 GPUs per

socket

2 sockets

per node

n nodes

Page 16: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

16

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node n-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

1 process

per GPU P

0

P

1

P

2

P

3

P

4

P

5

P

6

P

7

P

8

P

9 P10

P11

P12

P13

P14

P15

P8n

-8

P 8n

-7

P 8n

-6

P 8n

-5

P 8n

-4

P 8n

-3

P 8n

-2

P 8n

-1

Page 17: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

17

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node n-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Process 0

t0 t1 t2 t3

Process 1

t0 t1 t2 t3

Process 2

t0 t1 t2 t3

Process 3

t0 t1 t2 t3

Process 2n-2

t0 t1 t2 t3

Process 2n-1

t0 t1 t2 t3

1 thread

per GPU

1 process

per socket

Page 18: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

18

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node n-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

1 process

per node Process 0 Process 1 Process n-1

8 GPUs per

process

Page 19: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

19

NCCL 2.0 API

NCCL 2.0 is introducing mandatory new verbs ncclGroupStart/ncclGroupEnd when managing multiple devices from a single thread

NCCL 1.x : for (int i=0; i<ngpus; i++) {

cudaSetDevice(devices[i]);

ncclAllReduce(…, comms[i], streams[i]);

}

NCCL 2.0 : ncclGroupStart();

for (int i=0; i<ngpus; i++) {

ncclAllReduce(…, comms[i], streams[i]);

}

ncclGroupEnd();

Group calls

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Process 0

Page 20: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

20

NCCL 2.0 API

Inter-node communicator creation still uses the NCCL 1.x verbs : ncclGetUniqueId/ncclCommInitRank if (rank == 0) ncclGetUniqueId(&id)

My_Bcast(&id);

ncclCommInitRank(&comm, nranks, id, rank);

Multi-process + multi-GPU per process (from a single thread) : combine ncclCommInitRank with ncclGroupStart/ncclGroupEnd if (rank == 0) ncclGetUniqueId(&id)

My_Bcast(&id);

ncclGroupStart();

for (int i=0; i<ndev; i++) {

cudaSetDevice(devices[i]);

ncclCommInitRank(&comm, ndev*nranks, id, ndev*rank+i);

}

ncclGroupEnd();

Integration with parallel environments

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

P

0

P

1

P

2

P

3

P

4

P

5

P

6

P

7

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Process 0

Page 21: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

21

NCCL 2.0 API

Other small API adjustments over the NCCL 1.x API :

Counts are now of type size_t instead of int

allGather arguments order has been fixed to be similar to other operations

Additions/clarification on datatypes : integral : int8 = char, uint8, int32 = int, uint32, int64, uint64 floating point : float16 = half, float32 = float, float64 = double

Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations

Others

Page 22: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

22

PERFORMANCE

Page 23: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

23

PERFORMANCE Intra-node performance

0

10

20

30

40

50

60

4 QPI 4 CPU 4 PCI DGX-1

AllReduce bandwidth (OMB, size=128MB, in GB/s)

Page 24: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

24

PERFORMANCE Inter-node performance

0

5

10

15

20

25

30

35

40

45

2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

AllReduce bandwidth (OMB, size=128MB, in GB/s)

MPI

Baidu Allreduce

NCCL

Page 25: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

25

PERFORMANCE Deep Learning - CNTK

0

1000

2000

3000

4000

5000

6000

7000

8000

0 8 16 24 32

CNTK scaling ResNet50, images/s

Ideal MPI NCCL

217

1684

3281

6569

1645 1744

3360

Page 26: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

26

FUTURE

Page 27: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design

27

FUTURE

Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions)

User-defined reduction operations also, trying to merge computation and communication better

Windows support

Please let us know your needs ! Connect with experts / NCCL session : Wed Apr 10, 4pm

Top asked features

Page 28: NCCL 2 - NVIDIA · GPU Direct P2P . 4 NCCL Architecture NCCL CUDA Caffe CUBLAS Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NVIDIA GPUs CUDNN . 5 AGENDA NCCL History Design