40
Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Gagan Agrawal (The Ohio State University) Srimat Chakradhar (NEC Laboratories America) 1

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Embed Size (px)

Citation preview

Page 1: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Supporting GPU Sharing in CloudEnvironments with a Transparent

Runtime Consolidation Framework

Vignesh Ravi (The Ohio State University)Michela Becchi (University of Missouri)

Gagan Agrawal (The Ohio State University)Srimat Chakradhar (NEC Laboratories America)

1

Page 2: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Two Interesting Trends

• GPU, “Big player” in High Performance Computing– Excellent “price-performance” and “performance-per-watt”

ratio– Heterogeneous architectures – AMD Fusion APU, Intel

Sandy Bridge, NVIDIA Denver Project– 3 out of top 4 super computers (Tianhe-1A, Nebulae, and

Tsubame)

• Emergence of Cloud – “Pay-as-you-go” model– Cluster instances , High-speed interconnects for HPC users– Amazon, Nimbix GPU instances

2

BIG FIRST STEP!But at initial stages

Page 3: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Motivation

• Sharing is the basis of cloud, GPU no exception– Multiple virtual machines may share a physical node

• Modern GPUs are expensive than multi-core CPUs– Fermi cards with 6 GB memory, 4000 $– Better resource utilization

• Modern GPUs expose high degree of parallelism– Applications may not utilize full potential

3

Page 4: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Related Work

• vCUDA (Shi et al.)• GViM (Gupta et al.)• gVirtuS (Guinta et al.)• rCuda (Duato et al.)

4

Enable GPU Visibility from Virtual Machines

Limitation: Only from Single Process Context

How to share GPUs from Virtual Machines?

CUDA Compute 2.0 + Supports Task Parallelism

Page 5: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Contributions

• A Framework for transparent GPU sharing in cloud– No source code changes required, feasible in cloud– Propose sharing through consolidation

• Solution to conceptual consolidation problem– New method for computing consolidation affinity scores– Two new molding methods– Overall Runtime consolidation algorithm

• Extensive evaluation with 8 benchmarks on 2 GPUs– At high contention, 50% improved throughput– Framework overheads are small

5

Page 6: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

6

Page 7: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Outline

•Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

7

Page 8: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

BACKGROUND

8

• GPU Architecture• CUDA Mapping and Scheduling

Page 9: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Background

9

SM

SH MEM

SM

SH MEM

SM

SH MEM

..

....

GPU Device MemoryResource Requirements < Max Available Inter-leaved execution

Resource Requirements > Max Available Serialized execution

Page 10: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Outline

• Background

• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

10

Page 11: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

UNDERSTANDING CONSOLIDATION on GPU

11

• Demonstrate Potential of Consolidation

• Relation between Utilization and Performance• Preliminary experiments with consolidation

Page 12: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

GPU Utilization vs Performance

12

0

2

4

6

8

10

12

14

2*256 4*256 8*256 16*256 32*256 64*256

Scal

abili

ty O

ver 1

*256

Execution configuration

Black Scholes Binomial Options PDE Solver Image Processing

Scalability of Applications

Linear

Sub-Linear

No Significant Improvement

Good Improvement

Page 13: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Consolidation with Space

and Time Sharing

13

SM

SH MEM

SM

SH MEM

SM

SH MEM

SM

SH MEM

App 1 App 2

Cannot utilize all SMs effectivelyBetter Performance at large no. of blocks

Page 14: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Outline

• Background• Understanding Consolidation on GPU

•Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

14

Page 15: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

FRAMEWORK DESIGN

15

• Challenges• gVirtuS Current Design• Consolidation Framework & its Components

Page 16: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Design Challenges

16

Enabling GPU Sharing

When & What to Consolidate

Overheads

Need a Virtual Process Context

Need Policies and Algorithms to decide

Light-Weight Design

Page 17: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

gVirtuS Current Design

17

Guest-Host Communication Channel

GPU1 GPUn…

Linux / VMM

Frontend Library

CUDA App2

VM2

Frontend Library

CUDA App1

VM1

CUDA Driver

CUDA Runtime

gVirtuS Backend

Backend Process 1

Backend Process 2

Guest Side

HostSide

• Fork Process• No Communication

b/w processes

Page 18: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Runtime Consolidation

Framework

18

BackEnd Server

Dispatcher

Policies Heuristics

GPU GPU

VirtualContext

VirtualContext

Workload Consolidator

Workload Consolidator

Queues Workloads to Dispatcher

Queues Workloads to Virtual Context Ready Queue

HOST SIDE

Workloads arrive from Frontend

Consolidation Decision Maker

Thread

Page 19: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

19

Page 20: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

CONSOLIDATION DECISION MAKING LAYER

• GPU Sharing Mechanisms & Resource Contention• Two Molding Policies• Consolidation Runtime Scheduling Algorithm

20

Page 21: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Sharing Mechanisms &

Resource Contention

21

Sharing Mechanisms

Consolidation by Space Sharing

Consolidation by Time Sharing

Res

ourc

e C

onte

nti

on

Large No. of Threads with in a block

Pressure on Shared Memory B

asis

of A

ffin

ity

Sco

re

Page 22: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Molding Kernel Configuration

• Perform molding dynamically• Leverage gVirtuS to intercept kernel launch• Flexible for configuration modification• Mold the configuration to reduce contention• Potential increase in application latency• However, may still improve global throughput

22

Page 23: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Two Molding Policies

23

Molding Policies

Time Sharing with Reduced Threads

Forced Space Sharing

14 * 256

7 * 256

14 * 512

14 * 128

May resolve shared memory

Contention

May reduce register pressure in

the SM

Page 24: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Consolidation SchedulingAlgorithm

• Greedy-based Scheduling Algorithm• Schedule “N” kernels on 2 GPUs• Input: 3-Tuple Execution Configuration list of all kernels • Data Structure: Work Queue for each Virtual Context

24

Overall Algorithm

Generate Pair-wise Affinity

Generate Affinity for List

Get Affinity By Molding

Page 25: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Consolidation Scheduling Algorithm

25

Configuration list

Create Work Queues for

Virtual Contexts

Generate Pair-wise Affinity

Find the pair with min. affinitySplit the pair into diff. Queues

(a1, a2) = Generate Affinity For List for each rem. KernelWith each Work Queue

(a3, a4) = Get Affinity By Molding for each rem. Kernel

With each Work QueueFind Max(a1, a2, a3, a4)

Push kernel into QueueDispatch Queues into

Virtual Contexts

Page 26: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer

•Experimental Results• Conclusions

26

Page 27: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

EXPERIMENTAL RESULTS• Setup, Metric & Baselines• Benchmarks• Results

27

Page 28: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Setup, Metric & Baselines

• Setup– A Machine with Two Intel Quad core Xeon E5520 CPU– Two NVIDIA Tesla C2050 GPU Cards

• 14 Streaming Multi Processors, each containing 32 cores• 3 GB Device Memory• 48 KB Shared Memory per SM

– Virtualized with gVirtuS 2.0

• Evaluation Metric– Global Throughput benefit obtained after consolidation of kernels

• Baselines– Serialized execution, based on CUDA Runtime Scheduling– Blind Round-Robin based consolidation (Unaware of exec. configuration)

28

Page 29: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Benchmarks & Goals

29

Benchmarks and its Characteristics

Benchmarks Memory characteristics Data Set DescriptionImage Processing (IP) No ShMem 2*3584*3584 pointsPDE Solver (PDE) No ShMem 2*3584*3584 pointsBlackScholes (BS) No ShMem 1,000,000 optionsBinomial Options (BO) Low ShMem (upto 3KB) 256 options, 2048 stepsK-Means Clustering (KM) Med ShMem (upto 16KB) 4194304 pointsK-Nearest Neighbour (KNN) Med ShMem (upto 16KB) 4194304 pointsEuler (EU) Heavy ShMem (upto 48KB) 10,000 nodes, 60,000 edgesMolecular Dynamics (MD) Heavy ShMem(upto 48KB) 130,000 nodes, 16,200,000 edges

Page 30: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Benefits of Space and Time Sharing Mechanisms

30

Space Sharing Time Sharing

• No resource contention• Consolidation through Blind Round-Robin algorithm• Compared against serialized execution of kernels

Page 31: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Drawbacks of Blind Scheduling

31

Presence of Resource Contentions

No benefit from Consolidation

Large Number of ThreadsShared Memory Contention

Page 32: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Effect of Molding

32

Contention – Large Threads Contention – Shared Memory

Time Sharing with Reduced Threads

Forced Space Sharing

Page 33: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Effect of Affinity Scores

33

Kernel Configurations• 2 kernels with 7*512• 2 kernels with 14*256

• No affinity – Unbalanced Threads per SM• With affinity – Better Thread Balancing per SM

Page 34: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Benefits at High Contention Scenario

34

8 Kernels on 2 GPUs

6 out of 8 Kernels molded31.5% improvement over Blind Scheduling50% over serialized execution

Page 35: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Framework Overheads

35

No Consolidation With Consolidation

Compared to plain gVirtuS executionOverhead always less than 1%

Compared with manually consolidated executionOverhead always less than 4%

Page 36: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results

•Conclusions36

Page 37: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Conclusions

• A Framework for transparent sharing of GPUs• Use Consolidation as a mechanism for sharing GPUs• No source code level changes• New Affinity and Molding methods• Runtime Consolidation Scheduling Algorithm• At high contention, significant throughput benefits• The overheads of the framework are small

37

Page 39: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Impact of Large Number of Threads

39

Page 40: Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi

Per Application Slowdown/ Choice of Molding

40

Application Slowdown Choice of Molding Type