Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework...

Supporting GPU Sharing in CloudEnvironments with a Transparent

Runtime Consolidation Framework

Vignesh Ravi (The Ohio State University)Michela Becchi (University of Missouri)

Gagan Agrawal (The Ohio State University)Srimat Chakradhar (NEC Laboratories America)

Two Interesting Trends

• GPU, “Big player” in High Performance Computing– Excellent “price-performance” and “performance-per-watt”

ratio– Heterogeneous architectures – AMD Fusion APU, Intel

Sandy Bridge, NVIDIA Denver Project– 3 out of top 4 super computers (Tianhe-1A, Nebulae, and

Tsubame)

• Emergence of Cloud – “Pay-as-you-go” model– Cluster instances , High-speed interconnects for HPC users– Amazon, Nimbix GPU instances

BIG FIRST STEP!But at initial stages

Motivation

• Sharing is the basis of cloud, GPU no exception– Multiple virtual machines may share a physical node

• Modern GPUs are expensive than multi-core CPUs– Fermi cards with 6 GB memory, 4000 $– Better resource utilization

• Modern GPUs expose high degree of parallelism– Applications may not utilize full potential

Related Work

• vCUDA (Shi et al.)• GViM (Gupta et al.)• gVirtuS (Guinta et al.)• rCuda (Duato et al.)

Enable GPU Visibility from Virtual Machines

Limitation: Only from Single Process Context

How to share GPUs from Virtual Machines?

CUDA Compute 2.0 + Supports Task Parallelism

Contributions

• A Framework for transparent GPU sharing in cloud– No source code changes required, feasible in cloud– Propose sharing through consolidation

• Solution to conceptual consolidation problem– New method for computing consolidation affinity scores– Two new molding methods– Overall Runtime consolidation algorithm

• Extensive evaluation with 8 benchmarks on 2 GPUs– At high contention, 50% improved throughput– Framework overheads are small

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

Outline

•Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

BACKGROUND

• GPU Architecture• CUDA Mapping and Scheduling

Background

SH MEM

GPU Device MemoryResource Requirements < Max Available Inter-leaved execution

Resource Requirements > Max Available Serialized execution

Outline

• Background

• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

UNDERSTANDING CONSOLIDATION on GPU

• Demonstrate Potential of Consolidation

• Relation between Utilization and Performance• Preliminary experiments with consolidation

GPU Utilization vs Performance

2*256 4*256 8*256 16*256 32*256 64*256

Execution configuration

Black Scholes Binomial Options PDE Solver Image Processing

Scalability of Applications

Linear

Sub-Linear

No Significant Improvement

Good Improvement

Consolidation with Space

and Time Sharing

SH MEM

App 1 App 2

Cannot utilize all SMs effectivelyBetter Performance at large no. of blocks

Outline

• Background• Understanding Consolidation on GPU

•Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

FRAMEWORK DESIGN

• Challenges• gVirtuS Current Design• Consolidation Framework & its Components

Design Challenges

Enabling GPU Sharing

When & What to Consolidate

Overheads

Need a Virtual Process Context

Need Policies and Algorithms to decide

Light-Weight Design

gVirtuS Current Design

Guest-Host Communication Channel

GPU1 GPUn…

Linux / VMM

Frontend Library

CUDA App2

Frontend Library

CUDA App1

CUDA Driver

CUDA Runtime

gVirtuS Backend

Backend Process 1

Backend Process 2

Guest Side

HostSide

• Fork Process• No Communication

b/w processes

Runtime Consolidation

Framework

BackEnd Server

Dispatcher

Policies Heuristics

GPU GPU

VirtualContext

Workload Consolidator

Queues Workloads to Dispatcher

Queues Workloads to Virtual Context Ready Queue

HOST SIDE

Workloads arrive from Frontend

Consolidation Decision Maker

Thread

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results• Conclusions

CONSOLIDATION DECISION MAKING LAYER

• GPU Sharing Mechanisms & Resource Contention• Two Molding Policies• Consolidation Runtime Scheduling Algorithm

Sharing Mechanisms &

Resource Contention

Sharing Mechanisms

Consolidation by Space Sharing

Consolidation by Time Sharing

Large No. of Threads with in a block

Pressure on Shared Memory B

Molding Kernel Configuration

• Perform molding dynamically• Leverage gVirtuS to intercept kernel launch• Flexible for configuration modification• Mold the configuration to reduce contention• Potential increase in application latency• However, may still improve global throughput

Two Molding Policies

Molding Policies

Time Sharing with Reduced Threads

Forced Space Sharing

14 * 256

7 * 256

14 * 512

14 * 128

May resolve shared memory

Contention

May reduce register pressure in

the SM

Consolidation SchedulingAlgorithm

• Greedy-based Scheduling Algorithm• Schedule “N” kernels on 2 GPUs• Input: 3-Tuple Execution Configuration list of all kernels • Data Structure: Work Queue for each Virtual Context

Overall Algorithm

Generate Pair-wise Affinity

Generate Affinity for List

Get Affinity By Molding

Consolidation Scheduling Algorithm

Configuration list

Create Work Queues for

Virtual Contexts

Generate Pair-wise Affinity

Find the pair with min. affinitySplit the pair into diff. Queues

(a1, a2) = Generate Affinity For List for each rem. KernelWith each Work Queue

(a3, a4) = Get Affinity By Molding for each rem. Kernel

With each Work QueueFind Max(a1, a2, a3, a4)

Push kernel into QueueDispatch Queues into

Virtual Contexts

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer

•Experimental Results• Conclusions

EXPERIMENTAL RESULTS• Setup, Metric & Baselines• Benchmarks• Results

Setup, Metric & Baselines

• Setup– A Machine with Two Intel Quad core Xeon E5520 CPU– Two NVIDIA Tesla C2050 GPU Cards

• 14 Streaming Multi Processors, each containing 32 cores• 3 GB Device Memory• 48 KB Shared Memory per SM

– Virtualized with gVirtuS 2.0

• Evaluation Metric– Global Throughput benefit obtained after consolidation of kernels

• Baselines– Serialized execution, based on CUDA Runtime Scheduling– Blind Round-Robin based consolidation (Unaware of exec. configuration)

Benchmarks & Goals

Benchmarks and its Characteristics

Benchmarks Memory characteristics Data Set DescriptionImage Processing (IP) No ShMem 2*3584*3584 pointsPDE Solver (PDE) No ShMem 2*3584*3584 pointsBlackScholes (BS) No ShMem 1,000,000 optionsBinomial Options (BO) Low ShMem (upto 3KB) 256 options, 2048 stepsK-Means Clustering (KM) Med ShMem (upto 16KB) 4194304 pointsK-Nearest Neighbour (KNN) Med ShMem (upto 16KB) 4194304 pointsEuler (EU) Heavy ShMem (upto 48KB) 10,000 nodes, 60,000 edgesMolecular Dynamics (MD) Heavy ShMem(upto 48KB) 130,000 nodes, 16,200,000 edges

Benefits of Space and Time Sharing Mechanisms

Space Sharing Time Sharing

• No resource contention• Consolidation through Blind Round-Robin algorithm• Compared against serialized execution of kernels

Drawbacks of Blind Scheduling

Presence of Resource Contentions

No benefit from Consolidation

Large Number of ThreadsShared Memory Contention

Effect of Molding

Contention – Large Threads Contention – Shared Memory

Time Sharing with Reduced Threads

Forced Space Sharing

Effect of Affinity Scores

Kernel Configurations• 2 kernels with 7*512• 2 kernels with 14*256

• No affinity – Unbalanced Threads per SM• With affinity – Better Thread Balancing per SM

Benefits at High Contention Scenario

8 Kernels on 2 GPUs

6 out of 8 Kernels molded31.5% improvement over Blind Scheduling50% over serialized execution

Framework Overheads

No Consolidation With Consolidation

Compared to plain gVirtuS executionOverhead always less than 1%

Compared with manually consolidated executionOverhead always less than 4%

Outline

• Background• Understanding Consolidation on GPU• Framework Design• Consolidation Decision Making Layer• Experimental Results

•Conclusions36

Conclusions

• A Framework for transparent sharing of GPUs• Use Consolidation as a mechanism for sharing GPUs• No source code level changes• New Affinity and Molding methods• Runtime Consolidation Scheduling Algorithm• At high contention, significant throughput benefits• The overheads of the framework are small

Thank You for your attention!

Questions?

Authors Contact Information:• raviv@cse.ohio-state.edu• becchim@missouri.edu• agrawal@cse.ohio-state.edu• chak@nec-labs.com

Impact of Large Number of Threads

Per Application Slowdown/ Choice of Molding

Application Slowdown Choice of Molding Type

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework...

Documents

BECCHI TO GUWAHATI

Mon sandwich degoutant -Vignesh

CONTRATTO GENETICO CAPRINO CATALOGO BECCHI

Michela del mistro presentazione 06.10 michela

empowerment by Vignesh

Vignesh It Ppt

Pinze a becchi mezzitondi · 4770001210 121- serie 3 pinze 4770001212 121D - Pinza extralunga a becchi dritti 4770001215 121M - Pinza extralunga a becchi piegati a 45° 4770001217

Advanced Regular Expression Matching for Line-Rate Deep Packet Inspection Sailesh Kumar, Jon Turner Michela Becchi, Patrick Crowley, George Varghese

School Name : KV MEG & CENTRE (SHIFT I) …...VIGNESH VIGNESH VIGNESH 86 199 190101054391789192 SYEDA ZAINAB 87 361 190102393382104746 ANANYA C 88 627 190104974993051829 SAMEEKSHA

11 An Improved Algorithm to Accelerate Regular Expression Evaluation Authors: Michela Becchi and Patrick Crowley Publisher: ANCS’07 Present: Kia-Tso Chang

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Wei Jiang

An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture

NET Tutorial 1 Carlo Becchi carlo.becchi@iol.it

Memory-Efficient Regular Expression Search Using State Merging Author: Michela Becchi, Srihari Cadambi Publisher: INFOCOM 2007. 26th IEEE International

Vignesh Portfolio (2)

St. John Bosco: Childhood at Becchi

KARRA NAGA VIGNESH

Profile - Dhana Vignesh Kumar ( Developer )

Vignesh Thiyagarajan_Resume2015

An Improved Algorithm to Accelerate Regular Expression Evaluation Author ： Michela Becchi 、 Patrick Crowley Publisher ： ANCS’07 Presenter ： Wen-Tse Liang