Design of a Virtualization Framework to Enable GPU …...sharing app 2 m c HD k 2 c DH f m c HD k 1 c DH f m c HD k 2 c DH f m c HD m c HD combined k 1 & k 2 c DH f c DH f app 1 and

Design of a Virtualization Framework to Enable GPU Sharing

in Cluster Environments

Kittisak Sajjapongse Michela Becchi University of Missouri

nps.missouri.edu

GPUs in Clusters & Clouds

• Many-core GPUs are used in supercomputers

3 out of the top 10 supercomputers use GPUs

Titan: > 20 petaflops, > 700 terabytes memory

18,688 nodes:

• 16-core AMD CPU

• 1 Nvidia Tesla K20 GPU

• Many-core GPUs are used in cloud computing

2 Kittisak Sajjapongse

Different usage paradigms Accelerator model Cluster/cloud model

1 application Multi-tenancy

GPU: dedicated resource GPU: shared resource

Explicit procurement of GPUs Resource virtualization & Transparency

Static (or programmer-defined) binding of application to GPUs

Dynamic (or runtime) binding of applications to GPUs → better resource utilization and load balancing

Intra-application scheduling Intra- and Inter-application scheduling

Memory management within application

Advanced memory management across applications required


Context

AMBER GROMACS NAMD GPUBlast LAMMPS

NAMD

AMBER Blast

Kittisak Sajjapongse

We have designed a runtime that…

• Abstracts GPUs from end-users

• Schedules applications on GPUs

• Dynamically binds applications to GPUs

• Allows GPU sharing

• Provides memory management

• Provides dynamic recovery and load balancing in case of GPU failure/upgrade/downgrade


Deployment scenarios • With cluster-level schedulers

– E.g.: TORQUE, SLURM

OS

CUDA

driver/runtime GPU1 GPU2 GPU3

Our RUNTIME

GPUn

Cluster-level scheduler

Intercept library

OS

CUDA


Our RUNTIME

GPUn

OS

CUDA


Our RUNTIME

GPUn

Intercept library Intercept library

CUDA app CUDA app CUDA app CUDA app CUDA app CUDA app

• With VM-based systems for cloud computing – E.g.: Eucalyptus

host OS

CUDA


Our RUNTIME

GPUn

host OS

CUDA


Our RUNTIME

GPUn

host OS

CUDA


Our RUNTIME

GPUn

guest OS

CUDA app

Intercept library guest OS

CUDA app


CUDA app


CUDA app


CUDA app


CUDA app

Intercept library

VM1 VM2 VM3 VM4 VMk VMn

VM manager


GPU sharing

• Inter-kernel sharing [HPDC’11] - When: GPU underutilized within a kernel

- Why: limited parallelism, small datasets

- How: kernel consolidation across applications

• Inter-application sharing [HPDC’12] - When: GPU underutilized within an application

- Why: long CPU phases

- How: application multiplexing on GPU

GPU

k1 k1 k1 k1 k1

k1 k1 k1 k2 k2

GPU CPU

app1 app1

app1 app1

time

app2 app1

app1 app2

time


GPU sharing

8

• Multi-process application sharing [HPDC’13] – When: GPU underutilized by multi-process applications (e.g. MPI)

– Why: Synchronization leads to intra- & inter-application imbalance

– How: Preempt some inactive processes to allow other processes to progress


GPU 0

GPU 1

A0

A1

A0

A1

B0

B1

B0

B1

time

GPU 0

GPU 1

A0

A1

A0

A1

B0 B1 B0

B1

time

Inter-kernel sharing [HPDC’11]

time

m cHD k1 cDH f app1

serialized execution

Inter-kernel sharing

app2 m cHD k2 cDH f

m cHD k1 cDH f m cHD k2 cDH f

m cHD m cHD combined k1 & k2 cDH f cDH f

app1 and app2 have no conflicting memory requirements


Space- vs. time-sharing: some results

10

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

2.1

BS+KM BO+KNN PDE+MD EU+IP BS+BO KM+KNN BO+EU BS+MD

Re

lati

ve T

hro

ugh

pu

t B

en

efi

t

Workload Mix

SPACE-SHARING TIME-SHARING

BATCH 1 BATCH 2 BATCH 3 BATCH 4

GPU1

GPU1

GPU1

GPU1

GPU2 GPU2

GPU2 GPU2


Molding

• Idea: – Downgrade the execution configuration of kernels so to

force beneficial sharing

– Penalize single application to improve overall throughput • Limiting # blocks → force space sharing

• Limiting # threads/block → force time sharing w/ interleaved execution

11

b11 b12 b13 b14 k1

kernel1

b21 b22 b23 b14 k1

kernel2

b11 b12 b13 b21 b22

kernel1 and kernel2 can space-share GPU after molding

Downgrade to 3 blocks

Downgrade to 2 blocks


Molding: some results

• Molding can improve overall throughput despite penalizing single applications

12

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

IP+BS PDE+MD IP+BO BS+KM

Re

lati

ve T

hro

ugh

pu

t B

en

efi

t

Worklaod Mix

No Molding MoldingGPU2

GPU2 GPU1

GPU1

FORCED TIME-SHARING FORCED SPACE-SHARING


Inter-application sharing [HPDC’12]

time

m cpu cHD k11 k12 cDH f cpu app1

m cpu cHD k21 cpu cDH app2 f f m cHD k22 cpu k23 cDH


sharing w/o conflicting

memory req.

sharing w/ conflicting

memory req.

CPU→ GPU xfer (app1)

GPU→ CPU xfer (app1)

GPU CPU

GPU CPU

GPU CPU


Our runtime: node-level view

GPU1 GPU2 GPUn

CUDA driver/runtime

…

vGPU11

vGPU12

vGPU1k

vGPU21

vGPU22

vGPU2k

vGPUn1

vGPUn2

vGPUnk …

connection

manager and

offload control

Our RUNTIME

dispatcher dispatcher dispatcher

Waiting contexts

Assigned contexts

Failed contexts

Memory manager

Swap

area

Page

table

node-to-node offloading

Virtual GPUs

Abstraction

GPU Sharing

Dispatcher

Scheduling

GPU binary

registration

Memory

manager

Virtual

memory

handling

14

app1 app2 app3 appj appN

Intercept

Lib Intercept

Lib

Intercept

Lib Intercept

Lib

Intercept

Lib


Mapping and scheduling (FCFS)

GPU1

vGPU11

vGPU12

GPU2

vGPU21

vGPU22

GPU3

vGPU31

vGPU32

CUDA driver/runtime

dispatcher

Waiting ctx

Assigned ctx

Failed ctx

t1 t2 t1 t2 t3 t1 t2 t3

app1 app2 app3

c11 c12 c21 c22 c23 c31 c32 c33

connection

manager and

offload control Memory manager

Swap

area

Page

table

FE library

Our RUNTIME

15

FE library FE library


Mapping and scheduling (FCFS)

GPU1

vGPU11

vGPU12

GPU2

vGPU21

vGPU22

GPU3

vGPU31

vGPU32

CUDA driver/runtime

dispatcher

Waiting ctx

Assigned ctx

Failed ctx

t1 t2 t1 t2 t3 t1 t2 t3

app1 app2 app3

c11 c12 c21 c22 c23 c31 c32 c33

c11 c12 c21 c22 c23 c31

complete!

c32

Hardware

configuration

and application-

GPU mapping

abstracted from

end-users

Time-sharing of

GPUs

FE library

connection

manager and

offload control Memory manager

Swap

area

Page

table

Our RUNTIME

16

FE library FE library


Delayed binding

GPU1

vGPU11

vGPU12

CUDA driver/runtime

Memory manager

dispatcher

Swap

area

Page

table

malloc1

copyHD11

copyHD12

kernel1

copyDH1

app1

malloc2

copyHD2

kernel2

copyDH2

app2 appj appN

app1

app2 d2

c1 c2 d1

connection

manager and

offload control

Deferral of application-GPU mapping o Better scheduling

decisions o GPU memory

allocation when needed

Memory manager in runtime o dynamic binding


vGPU21

Dynamic binding & swapping

GPU1

vGPU11

vGPU12

CUDA driver/runtime

Memory manager

dispatcher Swap

area

Page

table

malloc1

copyHD1

kernel11

kernel12

app1

malloc2

copyHD2

kernel2

malloc3

copyHD3

kernel3

malloc4

copyHD4

kernel4

app2 app3 app4

app1

app2

c1

c2

GPU2

vGPU22

app3

app4

c4 c3

d1

d1

d3

d3 d4

d4

d2 d2

Full!

Swap!

d1

connection

manager and

offload control

GPU sharing among applications with conflicting memory requirements

Migration of applications from slower to faster GPUs

High availability in case of GPU failure

Load balancing in case of GPU upgrade-downgrade


Experiments: sharing & swapping

• 2 Tesla C2050 and 1 Tesla C1060 GPUs

• 36 matmul jobs w/ 5 kernel calls and varying CPU phases

• Sharing increases performances by hiding CPU phases

0

0

0

0

0

12 49 51 75 86

0

50

100

150

200

250

300

350

400

450

500

0 0.5 1 1.5 2

Tota

l Exe

cuti

on

tim

e (s

ec)

Fraction of CPU code

serialized execution (1 vGPU)

GPU sharing (4 VGPUs)


Experiments: cluster w/ TORQUE

• 3 node cluster w/ 2 GPU nodes (2 Tesla C2050s and 1 C1060) • >2X performance improvement due to sharing, further 20% due to

offloading

0

200

400

600

800

1000

1200

1400

Tot (16 jobs) Avg (16 jobs) Tot (32 jobs) Avg (32 jobs) Tot (48 jobs) Avg (48 jobs)

Exec

uti

on

Tim

e (s

ec)

Metric (# jobs)


GPU sharing (4 vGPUs)

GPU sharing + load balancing


Load Imbalance in Multi-process applications [HPDC’13]

• Causes of load imbalance

– Intrinsic load-imbalance (intra-application)

– Different GPU capabilities (intra-application)

– Unmatched amount of GPUs to processes (inter-application)

– Synchronization among processes

Intra-application Imbalance

Inter-application Imbalance

Preemption Policies

• Maximum idle time-driven preemption – Preempt a context (process) whenever it does not utilize

the GPU for a predefined amount of time • PROS: easy implementation

• CONS: need for setting maximum idle time parameter

• Synchronization call-driven preemption – Preempt a context (process) whenever a collective

communication or synchronization call is serviced • PROS: no need for parameters setting

• CONS: – Either need for bookkeeping (complex implementation, overhead)

– Or unnecessary preemptions (last process enters synchronization point)

Experiment: Node-level

• Intra-application imbalance – Batch scheduler fails to capture intra-application imbalance

– N-way sharing hides CPU execution behind GPU execution phases of co-located processes

– Combining 2-way sharing and preemption further improves performance

150

170

190

210

230

250

270

290

10 20 30 40 50

Ove

rall

exec

uti

on

tim

e (s

eco

nd

s)

Percentage imbalance

Batch scheduling

2-way sharing

4-way sharing

Preemptive sharing

Preemptive 2-way sharing

38%

8.8%

Experiment: Node-level

• Inter-application imbalance – Batch scheduler causes GPU underutilization, leading to performance loss

– N-way sharing provides improvement only if the imbalance is high

– Preemptive sharing allows correcting the imbalance and leads to performance improvement

150

170

190

210

230

250

270

290

3x[4] + 1x[3] 2x[4] + 2x[3] 1x[4] + 3x[3] 4x[3]

Ove

rall

exec

uti

on

tim

e (s

eco

nd

s)

Workload composition

Batch scheduling

2-way sharing

4-way sharing

Preemptive sharing

Preemptive 2-way sharing

Experiment: Cluster-level

• 2 nodes with 7 GPUs

• Batch scheduler is unable to schedule jobs with more processes than GPUs

• 4-way sharing and preemptive sharing lead to 25-30% and 40-45% performance improvement, respectively

0

100

200

300

400

500

600

700

800

900

batch scheduling 4-way sharing Preemptive 2-way sharing

Ove

rall

exec

uti

on

tim

e (s

eco

nd

s)

Scheduling scheme

4 processes/app 6 processes/app 8 processes/app

Conclusion • Node-level runtime providing

– GPU virtualization (Manageability)

– GPU sharing (Utilization, Latency Hiding)

– Flexible scheduling (Configurability)

– Dynamic binding & Preemption (Utilization, Latency Hiding)

• What lies ahead… – Integration with cluster-level scheduler

– Dynamic scheduling at the cluster-level

– Power efficiency considerations


Thanks • My coauthors

29

• You all for the attention!

Michela @ MU Xiang @ MU Ian @ MU Adam @ MU Vignesh @ AMD Chak @ NEC


Understanding GPU resource utilization (cont’d)

30

# thread-blocks

< # SMs

GPU

b11 b12 b13 k1 k1

SM underutilization

b11 b12 b13 b21 b22

SPACE SHARING

GPU

b11 b12 b13 b14 b15 all SMs busy

b11

b21

b12

b22

b13

b23

b14

b24

b15

b25

TIME SHARING

SERIALIZED EXECUTION of THREAD-BLOCKS

INTERLEAVED EXECUTION of THREAD-BLOCKS

Latencies hiding

Y

co-scheduled

thread-blocks have conflicting

register/sh.mem req.

Y N

N

Best

case Worst

case


Experiments: runtime overhead

• 1 Tesla C2050 GPU, short-running jobs • overhead < 10% in worst case, and amortized through GPU sharing

0

5

10

15

20

25

1 2 4 8

Tota

l Exe

cuti

on

tim

e (s

ec)

# of jobs

CUDA Runtime

1 vGPU

2 vGPUs

4 vGPUs

8 vGPUs


HPDC’13

• From single-process, single threaded applications to multi-process/multi-threaded applications

• Challenge: synchronizations (e.g. barrier synchronizations, communication primitive) can introduce GPU underutilization

• Solution: Preemptive GPU sharing


33

A11

A21

A31

A41

B11

A42

A32

B21

(b) Controlled 2-way sharing

C21

C11

syncA1 syncA2syncB1

A12

A22

syncC1

GPU0

GPU1

GPU2

GPU3

time

A11

A21

A31

A41 A42

A22

A32

A12

C21

C11B11

B21

(c) Preemptive sharing syncA1 syncC1syncB1syncA2

GPU0

GPU1

GPU2

GPU3

B12A11

A21

A31

A41

A12

A22

A32

C21

syncA1 syncA2(a) Batch Scheduling

A42

B11

syncB1syncC1

GPU0

GPU1

GPU2

GPU3

C11

Scenario 1: Intra-application Imbalance


34

vGPU00GPU0

vGPU10GPU1

vGPU20GPU2

vGPU30GPU3

A11

A21

A31

B11

A12

A22

A32

A13

A23

A33

B21

B31

C11

B22

B32

B12

B23

B33

B13 C21

C31

C22

C12

C32

C23

C13

C33

(a) Batch scheduling

C21vGPU00GPU0

vGPU10GPU1

vGPU20GPU2

vGPU30GPU3

A11

A21

A31

B11

A12

A22

A32

A13

A23

A33

B21 B31

C11

B22B32 B12

B13

B23

B33

C31

C22

C12

C32

C23

C13

C33

(c) Preemptive sharing

B32

B22

C21

vGPU00

vGPU01

GPU0

vGPU10

vGPU11

GPU1

vGPU20

vGPU21

GPU2

vGPU30

vGPU31

GPU3

A11

A21

A31

B11

B21

B31

C11

A12

A22

A32 A33

A23

A13

B1,2 B13

B23

B33

C31

C22

C12

C32

C23

C13

C33

(b) Controlled 2-way sharing

vGPU00

vGPU01

GPU0

vGPU10

vGPU11

GPU1

vGPU20

vGPU21

GPU2

vGPU30

vGPU31

GPU3

A11

A12

A13

B11

B21

B31

C11

C21

C31

A12

A22

A23

C12

C22

C32

B1,2

B32

B22

A23

A13

A33

C13

B23

B33

C23

C33 B13

(d) Preemptive 2-way sharing

App C App C sync. point

Idle time

App A

App B

App A sync. point

App B sync. point

Legend

time

Scenario 2: Inter-application Imbalance


Types of swapping operations

• Inter-application swapping – Time-sharing of GPU among applications with conflicting

memory requirements

• Intra-application swapping – Memory footprint of one application is the memory

footprint of the “largest” kernel

malloc(&A_d, size); malloc(&B_d, size); malloc(&C_d, size); copyHD(A_d, A_h, size); matmul(A_d, A_d, B_d); //B_d = A_d * A_d matmul(B_d, B_d, C_d); //C_d = B_d * B_d copyDH(B_h, B_d, size); copyDH(C_h, C_d, size);




memory requirements



malloc(&A_d, size); ON THE BARE CUDA RUNTIME… malloc(&B_d, size); malloc(&C_d, size); MEMORY CAPACITY EXCEEDED → RUNTIME ERROR! copyHD(A_d, A_h, size); matmul(A_d, A_d, B_d); //B_d = A_d * A_d matmul(B_d, B_d, C_d); //C_d = B_d * B_d copyDH(B_h, B_d, size); copyDH(C_h, C_d, size);




memory requirements



malloc(&A_d, size); ON OUR RUNTIME… malloc(&B_d, size); malloc(&C_d, size); copyHD(A_d, A_h, size); matmul(A_d, A_d, B_d); FIRST MEMORY ALLOCATION & DATA XFER TO GPU (A_d & B_d) matmul(B_d, B_d, C_d); SWAP(A_d) & MEMORY ALLOCATION (C_d) copyDH(B_h, B_d, size); copyDH(C_h, C_d, size);


Experiments: load balancing w/ dynamic binding

• Unbalanced system: 2 Tesla C2050 and 1 Quadro 2000 GPUs • Especially on small batches of jobs, dynamic binding improves

performance

0 0

0

0 0

0

4

4

0

4

4

4

0

200

400

600

800

1000

1200

1400

12 24 36 12 24 36

Tota

l exe

cuti

on

tim

e (s

ec)

# of jobs

no load balancing

load balancing through dynamic binding

cpu fraction = 0 cpu fraction = 1


Runtime configurations

• Only initial memory transfer deferral

– Only memory transfers before 1st kernel call deferred

– Pros: Overlap computation/communication

– Cons: More swapping overhead

• Unconditional memory transfer deferral

– All memory transfers are deferred

– Pros: Less swapping overhead

– Cons: No overlapping computation/communication


Application call Actions performed by runtime Errors returned by the runtime

Malloc Create PTE A virtual address cannot be assigned

Allocate swap

Swap memory cannot be allocated

CopyHD Check valid PTE No valid PTE

Move data to swap Swap-data size mismatch

CopyDH Check valid PTE No valid PTE

If (PTE.toCopy2Swap)cudaMemcpyDH -

Free Check valid PTE No valid PTE

De-allocate swap Cannot de-allocate swap

If (PTE.isAllocated)

cudaFree

-

Launch Check valid PTE No valid PTE

If (^PTE.isAllocated) cudaMalloc -

If (PTE.toCopy2Dev) cudaMemcpyHD -

cudaLaunch -

Swap Check valid PTE No valid PTE

If (PTE.toCopy2Swap) cudaMemcpyDH -

If (PTE.isAllocated) cudaFree -


Flags for Page Table Entries handling

F/F/F F/T/F

T/F/F T/T/F

T/F/T

copyDH

swap

launch

launch

copyHD

copyDH

copyHD

copyHD

copyHD

copyDH

swap

isAllocated/toCopy2D/toCopy2S

swap swap

launch

copyDH copyDH


Documents

Design of a Virtualization Framework to Enable GPU …...sharing app 2 m c HD k 2 c DH f m c HD k 1 c DH f m c HD k 2 c DH f m c HD m c HD combined k 1 & k 2 c DH f c DH f app 1 and