Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Supporting GPU Sharing in CloudEnvironments with a Transparent

Runtime Consolidation Framework

Vignesh Ravi (The Ohio State University)Michela Becchi (University of Missouri)

Gagan Agrawal (The Ohio State University)Srimat Chakradhar (NEC Laboratories America)

1

Two Interesting Trends

• GPU, “Big player” in High Performance Computing– Excellent “price-performance” and “performance-per-watt”

ratio– Heterogeneous architectures – AMD Fusion APU, Intel

Sandy Bridge, NVIDIA Denver Project– 3 out of top 4 super computers (Tianhe-1A, Nebulae, and

Tsubame)

• Emergence of Cloud – “Pay-as-you-go” model– Cluster instances , High-speed interconnects for HPC users– Amazon, Nimbix GPU instances

2

BIG FIRST STEP!But at initial stages

Motivation

• Sharing is the basis of cloud, GPU no exception– Multiple virtual machines may share a physical node

• Modern GPUs are expensive than multi-core CPUs– Fermi cards with 6 GB memory, 4000 $– Better resource utilization

• Modern GPUs expose high degree of parallelism– Applications may not utilize full potential

3

Related Work

• vCUDA (Shi et al.)• GViM (Gupta et al.)• gVirtuS (Guinta et al.)• rCuda (Duato et al.)

4

Enable GPU Visibility from Virtual Machines

Limitation: Only from Single Process Context

How to share GPUs from Virtual Machines?

CUDA Compute 2.0 + Supports Task Parallelism

Contributions

• A Framework for transparent GPU sharing in cloud– No source code changes required, feasible in cloud– Propose sharing through consolidation

• Solution to conceptual consolidation problem– New method for computing consolidation affinity scores– Two new molding methods– Overall Runtime consolidation algorithm

• Extensive evaluation with 8 benchmarks on 2 GPUs– At high contention, 50% improved throughput– Framework overheads are small

5

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

6

Outline

•Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

7

BACKGROUND

8

• GPU Architecture• CUDA Mapping and Scheduling

Background

9

SM

SH MEM

SM

SH MEM

SM

SH MEM

..

....

GPU Device MemoryResource Requirements < Max Available Inter-leaved execution

Resource Requirements > Max Available Serialized execution

Outline

• Background

• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

10

UNDERSTANDING CONSOLIDATION on GPU

11

• Demonstrate Potential of Consolidation

• Relation between Utilization and Performance• Preliminary experiments with consolidation

GPU Utilization vs Performance

12

0

2

4

6

8

10

12

14

2*256 4*256 8*256 16*256 32*256 64*256

Scal

abili

ty O

ver 1

*256

Execution configuration

Black Scholes Binomial Options PDE Solver Image Processing

Scalability of Applications

Linear

Sub-Linear

No Significant Improvement

Good Improvement

Consolidation with Space

and Time Sharing

13

SM

SH MEM

SM

SH MEM

SM

SH MEM

SM

SH MEM

App 1 App 2

Cannot utilize all SMs effectivelyBetter Performance at large no. of blocks

Outline

• Background• Understanding Consolidation on GPU

•Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

14

FRAMEWORK DESIGN

15

• Challenges• gVirtuS Current Design• Consolidation Framework & its Components

Design Challenges

16

Enabling GPU Sharing

When & What to Consolidate

Overheads

Need a Virtual Process Context

Need Policies and Algorithms to decide

Light-Weight Design

gVirtuS Current Design

17

Guest-Host Communication Channel

GPU1 GPUn…

Linux / VMM

Frontend Library

CUDA App2

VM2

Frontend Library

CUDA App1

VM1

CUDA Driver

CUDA Runtime

gVirtuS Backend

Backend Process 1

Backend Process 2

Guest Side

HostSide

• Fork Process• No Communication

b/w processes

Runtime Consolidation

Framework

18

BackEnd Server

Dispatcher

Policies Heuristics

GPU GPU

VirtualContext

VirtualContext

Workload Consolidator

Workload Consolidator

Queues Workloads to Dispatcher

Queues Workloads to Virtual Context Ready Queue

HOST SIDE

Workloads arrive from Frontend

Consolidation Decision Maker

Thread

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

19

CONSOLIDATION DECISION MAKING LAYER

• GPU Sharing Mechanisms & Resource Contention• Two Molding Policies• Consolidation Runtime Scheduling Algorithm

20

Sharing Mechanisms &

Resource Contention

21

Sharing Mechanisms

Consolidation by Space Sharing

Consolidation by Time Sharing

Res

ourc

e C

onte

nti

on

Large No. of Threads with in a block

Pressure on Shared Memory B

asis

of A

ffin

ity

Sco

re

Molding Kernel Configuration

• Perform molding dynamically• Leverage gVirtuS to intercept kernel launch• Flexible for configuration modification• Mold the configuration to reduce contention• Potential increase in application latency• However, may still improve global throughput

22

Two Molding Policies

23

Molding Policies

Time Sharing with Reduced Threads

Forced Space Sharing

14 * 256

7 * 256

14 * 512

14 * 128

May resolve shared memory

Contention

May reduce register pressure in

the SM

Consolidation SchedulingAlgorithm

• Greedy-based Scheduling Algorithm• Schedule “N” kernels on 2 GPUs• Input: 3-Tuple Execution Configuration list of all kernels • Data Structure: Work Queue for each Virtual Context

24

Overall Algorithm

Generate Pair-wise Affinity

Generate Affinity for List

Get Affinity By Molding

Consolidation Scheduling Algorithm

25

Configuration list

Create Work Queues for

Virtual Contexts

Generate Pair-wise Affinity

Find the pair with min. affinitySplit the pair into diff. Queues

(a1, a2) = Generate Affinity For List for each rem. KernelWith each Work Queue

(a3, a4) = Get Affinity By Molding for each rem. Kernel

With each Work QueueFind Max(a1, a2, a3, a4)

Push kernel into QueueDispatch Queues into

Virtual Contexts

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer

•Experimental Results• Conclusions

26

EXPERIMENTAL RESULTS• Setup, Metric & Baselines• Benchmarks• Results

27

Setup, Metric & Baselines

• Setup– A Machine with Two Intel Quad core Xeon E5520 CPU– Two NVIDIA Tesla C2050 GPU Cards

• 14 Streaming Multi Processors, each containing 32 cores• 3 GB Device Memory• 48 KB Shared Memory per SM

– Virtualized with gVirtuS 2.0

• Evaluation Metric– Global Throughput benefit obtained after consolidation of kernels

• Baselines– Serialized execution, based on CUDA Runtime Scheduling– Blind Round-Robin based consolidation (Unaware of exec. configuration)

28

Benchmarks & Goals

29

Benchmarks and its Characteristics

Benchmarks Memory characteristics Data Set DescriptionImage Processing (IP) No ShMem 2*3584*3584 pointsPDE Solver (PDE) No ShMem 2*3584*3584 pointsBlackScholes (BS) No ShMem 1,000,000 optionsBinomial Options (BO) Low ShMem (upto 3KB) 256 options, 2048 stepsK-Means Clustering (KM) Med ShMem (upto 16KB) 4194304 pointsK-Nearest Neighbour (KNN) Med ShMem (upto 16KB) 4194304 pointsEuler (EU) Heavy ShMem (upto 48KB) 10,000 nodes, 60,000 edgesMolecular Dynamics (MD) Heavy ShMem(upto 48KB) 130,000 nodes, 16,200,000 edges

Benefits of Space and Time Sharing Mechanisms

30

Space Sharing Time Sharing

• No resource contention• Consolidation through Blind Round-Robin algorithm• Compared against serialized execution of kernels

Drawbacks of Blind Scheduling

31

Presence of Resource Contentions

No benefit from Consolidation

Large Number of ThreadsShared Memory Contention

Effect of Molding

32

Contention – Large Threads Contention – Shared Memory

Time Sharing with Reduced Threads

Forced Space Sharing

Effect of Affinity Scores

33

Kernel Configurations• 2 kernels with 7*512• 2 kernels with 14*256

• No affinity – Unbalanced Threads per SM• With affinity – Better Thread Balancing per SM

Benefits at High Contention Scenario

34

8 Kernels on 2 GPUs

6 out of 8 Kernels molded31.5% improvement over Blind Scheduling50% over serialized execution

Framework Overheads

35

No Consolidation With Consolidation

Compared to plain gVirtuS executionOverhead always less than 1%

Compared with manually consolidated executionOverhead always less than 4%

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results

•Conclusions36

Conclusions

• A Framework for transparent sharing of GPUs• Use Consolidation as a mechanism for sharing GPUs• No source code level changes• New Affinity and Molding methods• Runtime Consolidation Scheduling Algorithm• At high contention, significant throughput benefits• The overheads of the framework are small

37

38

Thank You for your attention!

Questions?

Authors Contact Information:• [email protected]• [email protected]• [email protected]• [email protected]

mailto:[email protected]




Impact of Large Number of Threads

39

Per Application Slowdown/ Choice of Molding

40

Application Slowdown Choice of Molding Type

Documents

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi