43
GPUrdma: GPU-side library for high performance networking from GPU kernels Feras Daoud Technion – Israel Institute of Technology Mark Silberstein Technion Amir Watad Technion 1

GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma: GPU-side library for high performance

networking from GPU kernels

Feras Daoud

Technion – Israel Institute of Technology

Mark S i lberste in

Technion

Amir Watad

Technion

1

Page 2: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Agenda

2

Introduction1

InfiniBand Background2

GPUrdma3

GPUrdma Evaluation4

GPI25

Page 3: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

3

What• A GPU-side library for performing RDMA directly from

GPU kernels

Why• To improve communication performance between distributed

GPUs

Results• 5 µsec GPU-to-GPU communication latency and up to 50 Gbps

transfer bandwidth

Page 4: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Evolution of GPU-HCA interaction:

GPU

GPU RAM

CPU

CPU RAM

HCA

Naive Version

Data PathControl Path

4

Page 5: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Evolution of GPU-HCA interaction:

GPU

GPU RAM

CPU

CPU RAM

HCA

GPU

GPU RAM

CPU

CPU RAM

HCA

Naive Version GPUDirect RDMA

5

Data PathControl Path

Direct Data Path

Page 6: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Evolution of GPU-HCA interaction:

GPU

GPU RAM

CPU

CPU RAM

HCA

GPU

GPU RAM

CPU

CPU RAM

HCA

GPU

GPU RAM

CPU

CPU RAM

HCA

Naive Version GPUDirect RDMA GPUrdma

6

Data PathControl Path

Direct Data Path Direct

Control Path

Page 7: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Motivations

GPUDirect RDMA Node

CPU_rdma_read()GPU_kernel<<<>>> {

GPU_Compute()}CPU_rdma_write()

7

Page 8: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Motivations

GPUDirect RDMA Node

CPU_rdma_read()GPU_kernel<<<>>> {

GPU_Compute()}CPU_rdma_write()

8

CPU Overhead

Page 9: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Motivations

GPUDirect RDMA Node

CPU_rdma_read()GPU_kernel<<<>>> {

GPU_Compute()}CPU_rdma_write()

9

Bulk-synchronous design and explicit pipelining

Communication

Communication

Computation

Page 10: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Motivations

GPUDirect RDMA Node

CPU_rdma_read()GPU_kernel<<<>>> {

GPU_Compute()}CPU_rdma_write()

10

CPU_rdma_read()GPU_kernel<<<>>> {

GPU_Compute()}CPU_rdma_write()

CPU_rdma_read()GPU_kernel<<<>>> {

GPU_Compute()}CPU_rdma_write()

CPU_rdma_read()GPU_kernel<<<>>> {

GPU_Compute()}CPU_rdma_write()

Multiple GPU kernel invocations

1. kernel calls overhead 2. Inefficient shared memory usage

Page 11: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Motivations

GPUDirect RDMA Node

CPU_rdma_read()GPU_kernel<<<>>> {

Find_Even_Num()}CPU_rdma_write()

11

Sparse data

1 0 0 1 0 0 1

0 3 5 2 5 1 8

Offsets

0x0

0x3

0x6

Page 12: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma library

GPU_kernel<<<>>> {GPU_rdma_read()GPU_Compute()GPU_rdma_write()

}

GPUrdma Node

12

• No CPU intervention

• Overlapping communication and computation

• One kernel call

• Efficient shared memory usage

• Send spare data directly from the kernel

Page 13: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Agenda

13

GPUrdma3

GPUrdma Evaluation4

GPI25

Introduction1

InfiniBand Background2

Page 14: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

InfiniBand Background1. Queue pair buffer (QP)

• Send queue• Receive queue

2. Work Queue Element• Contains communication instructions

3. Completion queue buffer (CQ)• Contains completion elements

4. Completion queue element• Contains information about completed jobs

14

Task

Ver

bs

Completion

QP Buffer

CQ Buffer

HC

A

WQE WQE

CQE CQE

Page 15: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

InfiniBand Background

1. Write work queue element to QP buffer

2. Ring the Door-Bell

3. Check completion queue element status

15

Ring the Door-Bell to execute jobs• MMIO address• Informs the HCA about new jobs

CPU Memory

QP CQ

CPU

Data

Control Path

Data Path

011001100110100110011001

Page 16: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Agenda

16

GPUrdma Evaluation4

GPI25

Introduction1

InfiniBand Background2

GPUrdma3

Page 17: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma Node

• Direct path for data exchange

• Direct HCA control from GPU kernels

• No CPU intervention

Native GPU Node

GPUControl Path

17

GPU Memory

QP CQ Data

011001100110100110011001

Data Path

Page 18: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma Implementation

18

CPU Memory

QP CQ Data

011001100110100110011001010101

CPU

GPU

Data Path

GPU Memory

Page 19: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma Implementation

Data Path - GPUDirect RDMA

19

CPU Memory

CPU

GPU

Data Path

GPU Memory

Data

011001100110100110011001010101

QP CQ

Page 20: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma Implementation

20

CPU Memory

QP CQ

CPU

GPU

GPU Memory

1. Move QP, CQ to GPU memory

Data

011001100110100110011001010101

Data Path

Page 21: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma Implementation

1. Move QP, CQ to GPU memory

Modify InfiniBand Verbs

• ibv_create_qp()• ibv_create_cq()

21

CPU Memory

CPU

GPU

Data Path

GPU Memory

QP CQ Data

011001100110100110011001010101

Page 22: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma Implementation

2. Map the HCA doorbell address into GPU address space

22

CPU Memory

CPU

GPU

Data Path

GPU Memory

QP CQ Data

011001100110100110011001010101

Page 23: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma Implementation

23

CPU Memory

CPU

GPU

Data Path

GPU Memory

QP CQ Data

011001100110100110011001010101

2. Map the HCA doorbell address into GPU address space

Modify NVIDIA driver

Page 24: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Agenda

24

GPI25

Introduction1

InfiniBand Background2

GPUrdma Evaluation4

GPUrdma3

Page 25: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma Evaluation

• Single QP

• Multiple QP

• Scalability - Optimal QP/CQ location

NVIDIA Tesla K40c GPU Mellanox Connect-IB HCA

25

Page 26: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma – 1 thread , 1 QP

26

CPU

GPU

• Best Performance CPU controller VS GPU controller

Page 27: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma – 1 thread , 1 QP

27

• GPU controller – Optimize doorbell rings

Doorbell Optimized

CPU

GPU

3x

Page 28: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma – 1 thread , 1 QP

28

• GPU controller – Optimize CQ poll

CQ Optimized

Doorbell Optimized

CPU

Page 29: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma – 32 threads , 1 QP

29

• GPU controller – Write parallel jobs

Parallel writes

CQ Optimized

CPU

Page 30: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUDirect RDMA

30

• CPU controller

CPU 30 QPs

CPU 1 QP

Page 31: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma – 30 QPs

31

• 1 QP per Block vs 30 QPs per Block

30 QPs per Block

CPU 30 QPs

1 QP per Block

50 Gbps

3x

Page 32: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Scalability – Optimal QP/CQ location:

1. QP and CQ in GPU memory

2. QP in GPU and CQ in system memory

3. CQ in GPU and QP in system memory

4. QP and CQ in GPU memory

32

CPU Memory

CPU

GPU

Data Path

GPU Memory

QP CQ Data

011001100110100110011001010101

Page 33: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Scalability – Optimal QP/CQ location:

33

CPU Memory

CPU

GPU

Data Path

GPU Memory

QP

CQ

Data

011001100110100110011001010101

1. QP and CQ in GPU memory

2. QP in GPU and CQ in system memory

3. CQ in GPU and QP in system memory

4. QP and CQ in GPU memory

Page 34: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Scalability – Optimal QP/CQ location:

34

CPU Memory

CPU

GPU

Data Path

GPU Memory

QP

Data

011001100110100110011001010101

1. QP and CQ in GPU memory

2. QP in GPU and CQ in system memory

3. CQ in GPU and QP in system memory

4. QP and CQ in GPU memory

CQ

Page 35: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Scalability – Optimal QP/CQ location:

35

CPU Memory

CPU

GPU

Data Path

GPU Memory

Data

011001100110100110011001010101

1. QP and CQ in GPU memory

2. QP in GPU and CQ in system memory

3. CQ in GPU and QP in system memory

4. QP and CQ in system memory

QP CQ

Page 36: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Optimal QP/CQ location:

Transfer latency [µsec]

QP in CPU QP in GPU

CQ in CPU 8.6 6.2

CQ in GPU 6.8 4.8

36

o Throughput: No difference

o Latency:

Page 37: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Limitations

37

GPUDirect RDMA - CUDA v7.5:

Running kernel may observe STALE DATA or data that arrives OUT-OF-ORDER

Scenario:

Intensive RDMA writes to GPU memory

Good news:

NVIDIA announced a CUDA 8 feature that enables consistent update

Suggested fix:

CRC32 integrity check API for error detection

Page 38: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Agenda

38

GPI2

Introduction1

InfiniBand Background2

GPUrdma3

GPUrdma Evaluation4

5

Page 39: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPI2 for GPUs:GPI - A framework to implement Partitioned Global Address Space (PGAS)GPI2 - Extends this global address space to GPU memory

Threads

Global memory Host

Local memory Host

Global memory GPU

Local memory GPU

39

Global memory Host

Local memory Host

Threads Threads

Global memory GPU

Local memory GPU

Threads

Page 40: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPI2 code example

CPU Nodegaspi_segment_create (CPU_MEM)

Initialize data

gaspi_write_notify

gaspi_notify_waitsome

gaspi_proc_term

GPU Nodegaspi_segment_create (GPU_MEM)

gaspi_notify_waitsome

GPU_Compute _data<<<>>>

gaspi_write_notify

gaspi_proc_term

40

Page 41: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPI2 using GPUrdma

CPU Nodegaspi_segment_create (CPU_MEM)

Initialize data

gaspi_write_notify

gaspi_notify_waitsome

gaspi_proc_term

GPU Nodegaspi_segment_create (GPU_MEM)

GPU_start_kernel <<<>>> {

gpu_gaspi_notify_waitsome

Compute_data()

gpu_gaspi_write_notify }

gaspi_proc_term

41

Page 42: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

GPUrdma Multi-Matrix vector product

CPU Node

Start timer

gaspi_write_notify

gaspi_notify_waitsome

Stop timer

GPU Node

gpu_notify_waitsome

Matrix_compute()

gpu_write_notify

Batch size [Vectors]

GPI2 GPUrdma

480 2.6 11.7

960 4.8 18.8

1920 8.4 25.2

3840 13.9 29.1

7680 19.9 30.3

15360 24.3 31.5

• System throughput in millions of 32x1 vector multiplications per second as a function of the batch size

42

Page 43: GPUrdma: GPU-side library for high performance networking from GPU kernels › ... › 2016 › slides › ross2016-daoud.pdf · 2016-06-01 · GPUrdma Multi-Matrix vector product

Lena Oden, Fraunhofer Institute for Industrial Mathematics:

• Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from the GPU

• Analyzing Put/Get APIs for Thread-collaborative Processors

Mark Silberstein, Technion – Israel Institute of Technology:

• GPUnet: networking abstractions for GPU programs

• GPUfs: Integrating a file system with GPUs

43

Related works

Thanks