71
Multi-GPU Systems

Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

Multi-GPU Systems

Page 2: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

GPU becoming more specialized2

Modern GPU “Processing Block”● 32 Threads● 16 INT● 16 single-precision FP● 8 double-precision FP● 4 SFU (sin, cos, log)● 2 Tensor units for DNN● 64KB RF

Page 3: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

GPU Streaming Multiprocessor3

● Contains 4 “Processing Blocks”● Each independently schedules a

set of 32 threads called a warp● Share L1 Cache between blocks

Page 4: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

GPU Hardware4

● V100 has 80 SM● 5376 FPU● Peak 15.7

TFLOPS

Page 5: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

GPU “Data center in a box”

› DGX› A Multi-GPU “Node”› 300GB/s NVlink 2.0 cube mesh› 1 PFLOPS› Faster Machine Learning

5

Page 6: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

6

Page 7: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

DGX Data Center7

Page 8: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

GPU Support in Cloud Computing Stack8

Page 9: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

GPUs in the Cloud

› Exponential demand for more compute power

9

Page 10: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

GPU inter-connection is getting complex10

Picture sources: NVlink whitepaper

20

20

20

20

80(in)+80(out)=160GB/s(Total)

Page 11: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

GPU inter-connection is getting complex11

Picture sources: NVlink whitepaper

25 50

50 25150(in)+150(out)=300GB/s(Total)

Page 12: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

GPU inter-connection is getting complex12

Page 13: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

How can we make efficient use of GPU inter-connects?

13

Page 14: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

NVLink: Fast communication between multi-GPUs

14

Page 15: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

Challenges of complex GPU inter-connects

› Programming Multi-GPU applications is hard

15

Page 16: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

Cliff Woolley, Sr. Manager, Developer Technology Software, NVIDIA

NCCL: ACCELERATED MULTI-GPU COLLECTIVE COMMUNICATIONS

Page 17: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

2

BACKGROUND

Efficiency of parallel computation tasks

• Amount of exposed parallelism

• Amount of work assigned to each processor

Expense of communications among tasks

• Amount of communication

• Degree of overlap of communication with computation

What limits the scalability of parallel applications?

Page 18: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

3

COMMON COMMUNICATION PATTERNS

Page 19: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

4

COMMUNICATION AMONG TASKS

Point-to-point communication

• Single sender, single receiver

• Relatively easy to implement efficiently

Collective communication

• Multiple senders and/or receivers

• Patterns include broadcast, scatter, gather, reduce, all-to-all, …

• Difficult to implement efficiently

What are common communication patterns?

Page 20: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

5

POINT-TO-POINT COMMUNICATION

Most common pattern in HPC, where communication is usually to nearest neighbors

Single-sender, single-receiver per instance

2D Decomposition

Boundaryexchanges

Page 21: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

7

COLLECTIVE COMMUNICATIONMultiple senders and/or receivers

Page 22: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

8

BROADCASTOne sender, multiple receivers

GPU0 GPU1 GPU2 GPU3

A

GPU0 GPU1 GPU2 GPU3

A A A A

broadcast

Page 23: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

9

SCATTEROne sender; data is distributed among multiple receivers

scatter

GPU0 GPU1 GPU2 GPU3

A B C D

GPU0 GPU1 GPU2 GPU3

A

B

C

D

Page 24: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

10

GATHERMultiple senders, one receiver

GPU0 GPU1 GPU2 GPU3

A

B

C

D

GPU0 GPU1 GPU2 GPU3

A B C D

gather

Page 25: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

11

ALL-GATHERGather messages from all; deliver gathered data to all participants

GPU0 GPU1 GPU2 GPU3

A B C D

GPU0 GPU1 GPU2 GPU3

A A A A

B B B B

C C C C

D D D D

all-gather

Page 26: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

12

REDUCECombine data from all senders; deliver the result to one receiver

GPU0 GPU1 GPU2 GPU3

A B C D

reduce

GPU0 GPU1 GPU2 GPU3

A+B+C+D

Page 27: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

13

ALL-REDUCECombine data from all senders; deliver the result to all participants

GPU0 GPU1 GPU2 GPU3

A B C D

all-reduce

GPU0 GPU1 GPU2 GPU3

A+B+C+D

A+B+C+D

A+B+C+D

A+B+C+D

Page 28: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

14

REDUCE-SCATTERCombine data from all senders; distribute result across participants

reduce-scatter

GPU0 GPU1 GPU2 GPU3

A0+B0+C0+D0

A1+B1+C1+D1

A2+B2+C2+D2

A3+B3+C3+D3

GPU0 GPU1 GPU2 GPU3

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

Page 29: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

15

ALL-TO-ALLScatter/Gather distinct messages from each participant to every other

GPU0 GPU1 GPU2 GPU3

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

all-to-all

GPU0 GPU1 GPU2 GPU3

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

D0 D1 D2 D3

Page 30: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

16

THE CHALLENGE OF COLLECTIVES

Page 31: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

17

THE CHALLENGE OF COLLECTIVES

Having multiple senders and/or receivers compounds communication inefficiencies

• For small transfers, latencies dominate; more participants increase latency

• For large transfers, bandwidth is key; bottlenecks are easily exposed

• May require topology-aware implementation for high performance

• Collectives are often blocking/non-overlapped

Collectives are often avoided because they are expensive. Why?

Page 32: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

18

THE CHALLENGE OF COLLECTIVES

Collectives are central to scalability in a variety of key applications:

• Deep Learning (All-reduce, broadcast, gather)

• Parallel FFT (Transposition is all-to-all)

• Molecular Dynamics (All-reduce)

• Graph Analytics (All-to-all)

• …

If collectives are so expensive, do they actually get used? YES!

Page 33: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

19

THE CHALLENGE OF COLLECTIVES

Scaling requires efficient communication algorithms and careful implementation

Communication algorithms are topology-dependent

Topologies can be complex – not every system is a fat tree

Most collectives amenable to bandwidth-optimal implementation on rings, andmany topologies can be interpreted as one or more rings [P. Patarasuk and X. Yuan]

Many implementations seen in the wild are suboptimal

Page 34: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

20

RING-BASED COLLECTIVES: A PRIMER

Page 35: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

21

BROADCASTwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 36: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

22

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵

𝑁: bytes to broadcast

𝐵: bandwidth of each link

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 37: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

23

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵Step 2: ∆𝑡 = 𝑁/𝐵

𝑁: bytes to broadcast

𝐵: bandwidth of each link

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 38: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

24

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵Step 2: ∆𝑡 = 𝑁/𝐵Step 3: ∆𝑡 = 𝑁/𝐵

𝑁: bytes to broadcast

𝐵: bandwidth of each link

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 39: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

25

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵Step 2: ∆𝑡 = 𝑁/𝐵Step 3: ∆𝑡 = 𝑁/𝐵Total time: 𝑘 − 1 𝑁/𝐵

𝑁: bytes to broadcast

𝐵: bandwidth of each link

𝑘: number of GPUs

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 40: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

26

BROADCASTwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 41: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

27

BROADCAST

Split data into 𝑆 messages

Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 42: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

28

BROADCAST

Split data into 𝑆 messages

Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 43: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

29

BROADCAST

Split data into 𝑆 messages

Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 44: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

30

BROADCAST

Split data into 𝑆 messages

Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)Step 4: ∆𝑡 = 𝑁/(𝑆𝐵)

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 45: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

31

BROADCAST

Split data into 𝑆 messages

Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)Step 4: ∆𝑡 = 𝑁/(𝑆𝐵)...

Total time: S𝑁/(𝑆𝐵) + (𝑘 − 2) 𝑁 (𝑆𝐵)= 𝑁(𝑆 + 𝑘 − 2)/(𝑆𝐵) → 𝑁/𝐵

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 46: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

32

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 0

Page 47: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

33

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 1

Page 48: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

34

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 2

Page 49: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

35

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 3

Page 50: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

36

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 4

Page 51: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

37

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 5

Page 52: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

38

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 6

Page 53: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

39

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 7

Page 54: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

40

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 2Step: 0

Page 55: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

41

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 2Step: 1

Page 56: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

42

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 2Step: 2

Page 57: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

43

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

done

Page 58: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

44

RING-BASED COLLECTIVESA primer

GPU0 GPU1

CPU

4-GPU-PCIe

GPU2 GPU3

Switch

PCIe Gen3 x16 ~12 GB/s

Page 59: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

45

RING-BASED COLLECTIVESA primer

GPU0 GPU1

CPU

4-GPU-PCIe

GPU2 GPU3

Switch

PCIe Gen3 x16 ~12 GB/s

Page 60: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

46

RING-BASED COLLECTIVES…apply to lots of possible topologies

GPU0 GPU1

CPU

GPU2 GPU3

Switch

GPU4 GPU5

CPU

GPU6 GPU7

Switch

PCIe Gen3 x16 ~12 GB/s

SMP Connection (e.g., QPI)

Page 61: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

47

RING-BASED COLLECTIVES…apply to lots of possible topologies

GPU2

CPU

4-GPU-FC

GPU1

GPU3

GPU0

Switch

32 GB/s

GPU2

CPU

4-GPU-Ring

GPU1

GPU3

GPU0

Switch

32 GB/s

NVLink~16 GB/s

PCIe Gen3 x16 ~12 GB/s

Page 62: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

48

RING-BASED COLLECTIVES…apply to lots of possible topologies

GPU2

CPU

4-GPU-FC

GPU1

GPU3

GPU0

Switch

GPU2

CPU

4-GPU-Ring

GPU1

GPU3

GPU0

Switch

NVLink~16 GB/s

PCIe Gen3 x16 ~12 GB/s

Page 63: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

49

INTRODUCING NCCL (“NICKEL”):ACCELERATED COLLECTIVES

FOR MULTI-GPU SYSTEMS

Page 64: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

50

INTRODUCING NCCL

GOAL:

• Build a research library of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of multi-GPU applications

APPROACH:

• Pattern the library after MPI’s collectives

• Handle the intra-node communication in an optimal way

• Provide the necessary functionality for MPI to build on top to handle inter-node

Accelerating multi-GPU collective communications

Page 65: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

51

NCCL FEATURES AND FUTURES

Collectives• Broadcast• All-Gather• Reduce• All-Reduce• Reduce-Scatter• Scatter• Gather• All-To-All• Neighborhood

Key Features• Single-node, up to 8 GPUs

• Host-side API

• Asynchronous/non-blocking interface

• Multi-thread, multi-process support

• In-place and out-of-place operation

• Integration with MPI

• Topology Detection

• NVLink & PCIe/QPI* support

(Green = Currently available)

Page 66: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

52

NCCL IMPLEMENTATION

Implemented as monolithic CUDA C++ kernels combining the following:

• GPUDirect P2P Direct Access

• Three primitive operations: Copy, Reduce, ReduceAndCopy

• Intra-kernel synchronization between GPUs

• One CUDA thread block per ring-direction

Page 67: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

53

NCCL EXAMPLE

#include <nccl.h>ncclComm_t comm[4];ncclCommInitAll(comm, 4, {0, 1, 2, 3});

foreach g in (GPUs) { // or foreach thread

cudaSetDevice(g);

double *d_send, *d_recv;

// allocate d_send, d_recv; fill d_send with data

ncclAllReduce(d_send, d_recv, N, ncclDouble, ncclSum, comm[g], stream[g]);

// consume d_recv

}

All-reduce

Page 68: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

54

NCCL PERFORMANCEBandwidth at different problem sizes (4 Maxwell GPUs)

All-Gather

All-Reduce

Reduce-Scatter

Broadcast

Page 69: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

55

AVAILABLE NOWgithub.com/NVIDIA/nccl

Page 70: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4

56

THANKS TO MY COLLABORATORS

Nathan Luehr

Jonas Lippuner

Przemek Tredak

Sylvain Jeaugey

Natalia Gimelshein

Simon Layton

This research is funded in part by the U.S. DOE DesignForward program

Page 71: Multi-GPUnael/217-f19/lectures/Multi-GPU.pdf · GPU becoming more specialized 2 Modern GPU “Processing Block” 32 Threads 16 INT 16 single-precision FP 8 double-precision FP 4