Runtime System and Scheduling Support for High-End CPU-GPU Architectures Vignesh Ravi

Runtime System and Scheduling Support for High-End CPU-GPU Architectures

Vignesh RaviDept. of Computer Science and Engineering

Advisor: Gagan Agrawal

1

The Death of Single-core CPU Scaling

2

The Landscape of Computing – Moore’s Law

Transistors

Clock Speed

Power

Efficiency

• Double the # of Transistors• Simply increase clock frequency• Of course! Consume more

power• Significantly improved

efficiency• Follows Moore’s law

The Free Lunch is over !• Single Core clock frequency

reaches a plateau• End of Moore’s law …• Alternate processor design

required

Until 2004

Since 2005, Now and Future…

• The rise of Multi-core, Many-core architectures …

• Parallel programming …

Rise of Multi-core, Many-core …

3

Multi-core CPUs

Many-core GPUs

Executive-like: More room for control logics2 – 12 coresClock speed: ~ 1.8 GHz – 3.3 GHz

Massive arithmetic, least controlSpecialized Co-processingIn the range of 512 coresClock speed: ~ 1.2 GHz

GFL

OP

S

Rise of Heterogeneous Architectures

• Today’s High Performance Computing– Multi-core CPUs, Many-core GPUs are mainstream

• Many-core GPUs offer– Excellent “price-performance”& “performance-per-watt”– Financial modeling, Gas and Oil exploration, Medical …

• Flavors of Heterogeneous computing– Multi-core CPUs + GPUs connected over PCI-E– Accelerated Processing Units (APU) , AMD Fusion– Intel MIC, Sandy Bridge, Nvidia Denver …

• Heterogeneous Architectures are pervasive– Supercomputers &Clusters, Clouds, Desktops, Notebooks,

Tablets, Mobiles … 4

Today’s Computing Platforms are Heterogeneous!

New Challenges are Emerging …

New Challenges

5

CPU +

GPU

Heterogeneous

Architecture

Appl

icat

ion(

s)

Question 1: How to benefit from CPU and GPU simultaneously? CPU/GPU Work

Distribution Module

Concurrency control/Synchronization

between CPU/GPU

Question 2: Improve utilization of GPUs?

Enable Sharing of GPU across diff.

apps.

Question 3: Job Scheduling for hetero. clusters?

Revisit Job scheduling for CPU-

GPU clusters

Question 4: Mechanisms to debug and profile GPU programs?

Tools development for GPUs

My Thesis Focus

6

CPU +

GPU

Heterogeneous

Architecture

Appl

icat

ion(

s)

CPU/GPU Work Distribution

Module

Concurrency control/Synchronization

between CPU/GPU

Enable Sharing of GPU across diff.

apps.

Revisit Job scheduling for CPU-

GPU clusters

Tools development for GPUs

Primary Focus

Thesis Contributions

Support for GPU Sharing across Multiple Applications

•Supporting GPU Sharing with a Transparent Runtime Consolidation Framework (HPDC 2011)

7

Runtime Systems and Dynamic Work Distribution for Heterogeneous Systems

• Compiler and Runtime Support for Enabling Generalized Reductions on Heterogeneous Systems (ICS 2010)• A Dynamic Scheduling Framework for Emerging Heterogeneous Systems (HiPC 2011)

Job Scheduling for Heterogeneous Clusters•Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes (CCGrid 2012)•Value-Based Scheduling Framework for Modern Heterogeneous Clusters (Under Submission)

Today’s Talk

Support for GPU Sharing across Multiple Applications

•Supporting GPU Sharing with a Transparent Runtime Consolidation Framework (HPDC 2011)

8

Runtime Systems and Dynamic Work Distribution for Heterogeneous Systems

• Compiler and Runtime Support for Enabling Generalized Reductions on Heterogeneous Systems (ICS 2010)• A Dynamic Scheduling Framework for Emerging Heterogeneous Systems (HiPC 2011)

Job Scheduling for Heterogeneous Clusters•Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes (CCGrid 2012)•Value-Based Scheduling Framework for Modern Heterogeneous Clusters (Under Submission for SC 2012)

Pre-Candidacy Work

Post-Candidacy Work

9

Outline of Presentation

• Recap of Pre-Candidacy work• Runtime system and Work Distribution• GPU Sharing Through Runtime Consolidation Framework

• Post-Candidacy work • Concurrent Job Scheduling to Improve Global Throughput• Value-based Job Scheduling

• Future Work• Thesis Conclusions

10





Motivation

• In HPC, demand for computing is ever increasing– CPU+GPU platform expose huge raw processing power

• Top 6 Supercomputers– Heterogeneous - utilization is under ~50% – Homogeneous - utilization is about 80%

• Application development for multi-core CPU and GPU is still independent– “No established mechanism” to exploit aggregate power

• Can computations benefit from simultaneously utilizing CPU and GPU?

11

Runtime System and Work Distribution for CPU-GPU Architectures

• Focus on specific classes of computation patterns– Generalized Reduction Structure– Structured Grid Computations

• Improve application developer productivity– Facilitate High-Level API support– Hide parallelization difficulties through runtime support

• Improve efficiency – Dynamic work distribution between CPU & GPU

• Show significant performance improvements – Up to 63% for generalized reduction structures– Up to 75% for structured grid computations

12

13


• Recap of Pre-Candidacy work• Runtime system and Work Distribution• GPU Sharing Through Runtime Consolidation

Framework• Post-Candidacy work

• Concurrent Job Scheduling to Improve Global Throughput• Value-based Job Scheduling


Motivation

• Emergence of Cloud – “Pay-as-you-go” model– Cluster instances, High-speed interconnects for HPC users– Amazon, Nimbix, SoftLayer - GPU instances

• Sharing is the basis of cloud, GPU no exception– Multiple virtual machines may share a physical node

• Modern GPUs are expensive than multi-core CPUs– Fermi cards with 6 GB memory, 4000 $– Need better resource utilization

• Modern GPUs expose high degree of parallelism– Applications may not utilize full potential

14

Sharing a GPU is necessary, but how?

GPU Sharing Through Runtime Consolidation Framework

• Software Framework to enable GPU Sharing– Extended Open Source Call Interception Tool, gVirtuS– GPU sharing through kernel consolidation & virtual context

• Basic GPU-Sharing Mechanisms– Time- and Space-Sharing

• Solutions to GPU Kernel Consolidation Problem– Affinity score, to predict benefit upon consolidation– Kernel Molding policies, to handle high resource contention– Overall scheduling algorithm for multiple GPUs

• Show significant global throughput improvements – Up to 50% improvement using advanced sharing policies

15

16



• Post-Candidacy work • Concurrent Job Scheduling to Improve Global

Throughput• Value-based Job Scheduling


Motivation

• Software Stack to program CPU-GPU arch. has evolved– Combination of (Pthreads/OpenMP…) + (CUDA/Stream)– Now, OpenCL is becoming more popular

• OpenCL, a device agnostic platform– Offers great flexibility with portable solutions– Write kernel once, execute on any device

• Supercomputers and Cloud environments are typically “Shared”– Accelerate a set of applications as opposed to single application– “Job Scheduler” is a critical component of software stack

• Today’s schedulers (like TORQUE) for hetero. clusters:– DO NOT exploit the portability offered by OpenCL– User-guided Mapping of jobs to hetero. resources– Does not consider desirable & advanced scheduling possibilities

17

Revisit Scheduling problems for CPU-GPU clusters1) Exploit portability offered by models like

OpenCL2) Automatic mapping of jobs to resources3) Desirable advanced scheduling considerations

Problem Formulations

Problem Goal:• Accelerate a set of applications on CPU-GPU cluster• Each node has two resources: A Multi-core CPU

and a GPU• Map applications to resources to:

– Maximize overall system throughput– Minimize application latency

Scheduling Formulations:1) Single-Node, Single-Resource Allocation &

Scheduling2) Multi-Node, Multi-Resource Allocation & Scheduling

18

Scheduling Formulations

• Allocates a multi-core CPU or a GPU from a node in cluster– Benchmarks like Rodinia (UV) & Parboil (UIUC) contain 1-

node apps.– Limited mechanisms to exploit CPU+GPU simultaneously

• Exploit the portability offered by OpenCL prog. Model

19

Single-Node, Single-Resource Allocation & Scheduling

Multi-Node, Multi-Resource Allocation & Scheduling• In addition, allows CPU+GPU allocation

– Desirable in future to allow flexibility in acceleration of applications

• In addition, allows multiple node allocation per job• MATE-CG [IPDPS’12], a framework for Map-Reduce

class of apps. allows such implementations

Challenges and Solution Approach

Decision Making Challenges:• Allocate/Map to CPU-only, GPU-only, or CPU+GPU?• Wait for optimal resource (involves queuing delay)• Assign to non-optimal resource (involves penalty)• Always allocating CPU+GPU may affect global

throughput– Should consider other possibilities like CPU-only or GPU-only

• Always allocate requested # of nodes?– May increase wait time, can consider allocation of lesser

nodesSolution Approach:• Take different levels of user inputs (relative

speedups, execution times…)• Design scheduling schemes for each scheduling

formulation20

Scheduling Schemes for First Formulation

21

Two Input Categories & Three Schemes: Categories are based on the amount of input

expected from the userCategory 1: Relative Multi-core (MP) and GPU (GP)

performance as inputScheme1: Relative Speedup based w/ Aggressive Option (RSA)Scheme2: Relative Speedup based w/ Conservative Option (RSC)

Category 2: Additionally, sequential CPU exec. Time (SQ)Scheme3: Adaptive Shortest Job First (ASJF)

Relative-Speedup Aggressive (RSA) or Conservative (RSC)

22

N Jobs, MP[n], GP[n]

Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)

Sort CJQ and GJQ in Desc. Order

R=GetNextResourceAvialable()

IsGPU

GJQ Empty?

YesNo

Assign GJQtop to R

Yes

Assign CJQbottom to R Wait for CPU

Aggressive?

Takes multi-core and GPU speedup as input• Create CPU/GPU

queues• Map jobs to optimal

resource queue

Aggressive, minimizes penalty

ConservativeYes No

Adaptive Shortest Job First (ASJF)

23

N Jobs, MP[n], GP[n], SQ[N]

Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)

Sort CJQ and GJQ in Asc. Order of Exec. Time

R=GetNextResourceAvialable()

IsGPU

GJQ Empty?

Yes

NoAssign GJQtop to R

YesT1= GetMinWaitTimeForNextCPU()

T2k= GetJobWithMinPenOnGPU(CJQ)

T1 > T2kAssign CJQk to R

Yes

No Wait for CPU to become free or for GPU jobs

Minimize latency for short jobs

Automatic switch for aggressive or conservative option

Scheduling Scheme for Second Formulation

24

Solution Approach:• Flexibly schedule on CPU-only, GPU-only, or

CPU+GPU• Molding the # of nodes requested by job

• Consider allocating ½ or ¼th of requested nodesInputs from User:• Execution times of CPU-only, GPU-only, CPU+GPU• Execution times of jobs with n, n/2, n/4 nodes• Such app. Information can also be obtained from

profiles

Flexible Moldable Scheduling Scheme (FMS)

25

N Jobs, Exec. Times…

Group Jobs with # of Nodes as the Index

Sort each group based on exec. time of CPU+GPU version

Pick a pair of jobs to schedule in order of sorting

Minimize resource fragmentationHelps co-locate CPU and GPU job on the same node

Gives global view to co-locate on same node

Find the fastest completion option from T(i,n,C), T(i,n,G), T(i,n,CG) for each

job

Choose C for one job & G for the other

Co-locate jobs on same set of nodes

Choose same resource for both jobs (C,C)

(G,G) (CG,CG)

2N Nodes Avail?

YesSchedule pair of jobs in parallel

on 2N nodes

No Consider Molding by Res. Type if CG

Consider Molding # of nodes for the next job

Cluster Hardware Setup

26

• Cluster of 16 CPU-GPU nodes• Each CPU is 8 core Intel Xeon E5520

(2.27GHz)• Each GPU is an Nvidia Tesla C2050 (1.15

GHz)• CPU Main Memory – 48 GB• GPU Device Memory – 3 GB• Machines are connected through Infiniband

Benchmarks

27

Single-Node Jobs• We use 10 benchmarks

• Scientific, Financial, Datamining, Image Processing applications

• Run each benchmark with 3 different exec. Configurations

• Overall, a pool of 30 jobsMulti-Node Jobs• We use 3 applications

• Gridding kernel, Expectation-Maximization, PageRank• Applications run with 2 different datasets and on 3

different node numbers• Overall, a pool of 18 jobs

Baselines & Metrics

28

Baseline for Single-Node Jobs• Blind Round Robin (BRR)• Manual Optimal (Exhaustive search, Upper Bound)Baseline for Multi-Node Jobs• TORQUE, a widely used resource manager for hetero. clusters• Minimum Completion Time (MCT), [Maheswaran et.al, HCW’99]

Metrics• Completion Time (Comp. Time)• Application Latency:

• Non-optimal Assignment (Ave. NOA. Lat)• Queuing Delay (Ave. QD Lat.)

• Maximum Idle Time (Max. Idle Time)

Single-Node Job Results

29

Uniform CPU-GPU Job Mix

CPU-biased Job Mix

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time

Nor

mal

ized

Ove

r Bes

t Cas

e

Metrics

BRR RSA RSC ASJF Manual Optimal

0.01.02.03.04.05.06.07.0

Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time

Nor

mal

ized

Ove

r Bes

t Cas

e

Metrics

BRR RSA RSC ASJF Manual Optimal

• 24 Jobs on 2 NodesProposed

schemes

4 different metrics

For each metric

• 108% better than BRR• Within 12% of Manual

Optimal• Tradeoff between non-

optimal penalty vs wait-time for resource• BRR has the highest latency

• RSA, non-optimal penalty• RSC, high Queue delay• ASF as good as Manual

optimal• BRR, very high idle times• RSC, can be very high too• RSA has the best utilization

among proposed schemes

Multi-Node Job Results

30

Varying Job Execution Lengths

Varying Resource Request Size

0.60.70.80.9

11.11.21.31.41.5

75 SJ/25 LJ 50 SJ/50 LJ 25 SJ/75 LJ

Nor

mal

ized

Com

pleti

on T

ime

Job Mix

Torque MCT

Molding ResType Only Molding NumNodes Only

Molding ResType+NumNodes(FMS)

0.60.70.80.9

11.11.21.31.4

75 SR/25 LR 50 SR/50 LR 25 SR/75 LR

Nor

mal

ized

Com

pleti

on T

ime

Job Mix

Torque MCTMolding ResType Only Molding NumNodes OnlyMolding ResType+NumNodes(FMS)

Short Job (SJ), Long Job (LJ)

Small Request (SJ), Large Request (LJ)

Proposed schemes • 32 Jobs on 16

Nodes• FMS, 42% better than best of Torque or MCT

• Each type of molding gives reasonable improvement

• Our schemes utilizes the resource betterhigh throughput

• Intelligent on deciding to wait for res. or mold it for smaller res.• FMS, 32% better than best of Torque or MCT

• Benefit from ResType Molding is better than NumNodes Molding

Summary

31

• Revisit scheduling problems on CPU-GPU clusters• Goal to improve aggregate throughput• Single-node, single-resource scheduling problem• Multi-node, multi-resource scheduling problem

• Developed novel scheduling schemes• Exploit portability offered by OpenCL• Automatic mapping of jobs to hetero. resources• RSA, RSC, and ASJF for single-node jobs• Flexible Molding Scheduling (FMS) for multi-node

jobs• Significant improvement over state-of-the-art

32





Motivation • Previously, goal to improve overall global throughput & latency• Other desirable goals for supercomputer and cloud environments

– Market-based scheduling goals (providers’ profit and user-satisfaction)– For eg., MOAB (with SLAs) for supercomputers and large clusters– For eg., Amazon classifies as Free, Spot, On-Demand, Reserved users– Each user has different levels of importance and satisfaction

• Supercomputer, clouds engage massively parallel resources– Multi-core CPUs with 16 cores, GPUs with 512 cores– Recent announcements of MIC (about 50-60 cores) in stampede– Efficient resource utilization is important

• Today’s schedulers (like TORQUE) for hetero. clusters:– No notion of market-based scheduling– User-guided Mapping of jobs to hetero. resources– Lack ability/schemes to share massively parallel resources

33

Revisit Scheduling problems for CPU-GPU clusters1) Exploit portability offered by models like

OpenCL2) Automatic mapping of jobs to resources3) Market-based scheduling considerations4) Schemes to enable automatic sharing of

resources

Value Function

34

• Each job is attached with a value function• Linear-Decay Value Function [Irwin et.al HPDC’04]

– Maximum Value Importance/priority – Decay Rate Urgency

• Value function with different shapes– Can represent different SLAs, eg. Step function

• Yield is obtained after job completion, defined as

• Delay can be a sum of any of four components– Queuing, non-optimal penalty, sharing 1-core penalty, sharing CPU/GPU

penalty• Yield represents both “Providers’ profit” as well as “User-

satisfaction”

Yield = maxValue – decay * delay

We believe that value function provides rich, yet, simple formulation for market-based

scheduling

Scheduling Problem Formulation

• Given hetero. cluster with each node containing:– 1 multi-core CPU and 1 GPU

• Schedule a set of jobs on the cluster– To maximize the aggregate yield

• Allocates a multi-core CPU or a GPU from a node in cluster– Does not allocate both multi-core CPU and GPU to a job– Does not allocate multiple nodes to a job– Considerations for future work

• Exploit the portability offered by OpenCL prog. Model– Flexibly map the job on to either CPU or GPU

• Allow sharing of multi-core CPU or GPU– Up to two jobs per resource– Limited to space-sharing 35

Overall Scheduling Approach

36

Jobs arrive in batches

Enqueue into CPU Queue Enqueue into GPU Queue

Execute on CPU Execute on GPU

Push job in to its optimal resource queue and sortInitial Mapping and Ordering

When both job queues are non-empty

• Resource (CPU) is free• But, job (CPU) queue is

empty• Resource will be idle• Propose various schemes

for dynamic re-mapping

Sort jobs to improve yield Sort jobs to improve yield

Heuristics for Different Stages

• Initial mapping & Ordering of queues– Initial assignment of jobs to queue: Based on optimal walltime– Sorting of jobs in the queue: Adapt Reward [Earlier Work: HPDC’04] to

our formulation• Dynamic Re-mapping of jobs to Non-optimal Resource

– Uncoordinated Schemes (Three new heuristics)• Last Optimal Reward (LOR)• First Non-Optimal Reward (FNOR)• Last Non-optimal Reward Penalty (LNORP)

– Coordinated Schemes (One new heuristic)• Coordinated Least Penalty (CORLP)

• Sharing jobs on a single type of resource (One New heuristic)– Scalability-Decay Factor, Top K fraction [K is tunable]

37

Sorting Jobs in the Queues• Reward heuristic is based on two market-based terms

– Present (Discounted Gain) Value– Opportunity Cost

• Present Value (PV)– Value gain after time ‘t’, after discounting risk of running the job– Receiving $1,000 now is worth more than $1,000 five years from now– Shorter the job, lower the risk

• Opportunity Cost (Cost)– Degradation cost of an alternative to pursue a certain action – Prefer high decay jobs over low decay jobs– In our case, cost of choosing a job ‘i’ over a job ‘j’

• Reward– Choose the job with highest reward to schedule on the corresponding resource

38

PVi / OptimalWTi = yieldi / (1+dis_rate*OptimalWTi)

Costi / OptimalWTi = Σ decayj – decayi j=0

n

Rewardi = (PVi – Costi) / OptimalWTi

Dynamic Remapping – Uncoordinated Schemes

• Only when the resource is idle, and job queue is empty– Idle resources reduce utilization, hence overall yield (considering waiting

jobs in other queue)– Dynamically assign a job to non-optimal resource from optimal queue for

that job• Three Schemes based on two key aspects

– Which job will have best reward on non-optimal resource?– Which job will suffer least reward penalty ?

1. Last Optimal Reward (LOR)– Exploits “Reward score” computed on each queue for each job– Simply chooses job with least reward from the optimal resource queue– Anyway least reward on optimal resource, least risk in moving– O(N) to seek the last job in the queue

39

Dynamic Remapping – Uncoordinated Schemes

2. First Non-Optimal Reward (FNOR)– Compute the reward job could produce on non-optimal resource– Explicitly considers non-optimal penalty– Job with highest reward on non-optimal resource– O(Nlog(N)) to sort the newly computed reward

3. Last Non-Optimal Reward Penalty (LNORP)– FNOR fails to consider reward degradation– LNORP computes reward degradation on non-optimal resource– Moves the job with least reward degradation

40

Suff_factori = Non-OptimalWTi / OptimalWTiNon-OptimalRewardi = OptimalRewardi / Suff_factori

Non-OptimalRewardPenalty = OptimalRewardi - NonOptimalRewardi

Dynamic Remapping – Coordinated Scheme

• Even when resource is not idle, and job queue is non-empty– May be necessary to move job from one queue to another due to imbalance– Better global view of both the queues

• Factors affecting imbalance,– Decay rates of jobs across queues– Execution lengths (or queuing delays) of jobs across queues

• For coordination across queue,– Determine when coordination is required– If coordination required, heuristic for “which” job to move

• Detecting when coordination is required– Total Queuing-Delay Decay-Rate Product (TQDP), for each queue ‘i’

• Heuristic for picking a job to move– Move the job with least non-optimal penalty

• Coordinated Least Penalty (CORLP)41

TQDPi = Σ Queuing_delayj * decayj j=0

n

Heuristic for Sharing• Allow up to two jobs to space-share a resource

– For eg., on a multi-core CPU with 8 cores, 2 jobs each use 4 cores– Penalties from time-sharing can be high due to more resource contention

• Factors affecting sharing– Jobs will use half the resources, will incur a slowdown– On the other hand, more resources may be available

• Jobs/applications– Can be categorized as low, medium, high scaling (based on models/profiling) – Some jobs are less urgent than the other

• “When” to enable sharing?– Large fraction of jobs in pending queues with negative yield

• “Who” are the candidates to share? (Scalability-DecayRate factor)– Jobs grouped in the order of low to high scalability– Within each group, jobs are ordered by decay rate– Pick top K fraction of jobs, ‘K’ is tunable (low scalability, low

decay) 42

43

Master Node

Cluster Level Scheduler

Scheduling Schemes & Policies

TCPCommunicator

Submission Queue

Pending Queues

Execution Queues

Finished Queues

Compute Node

Node Level Scheduler

Multi-core CPU GPU

Compute Node

Node Level Scheduler

Multi-core CPU GPU

…

TCP Communicator

CPU Jobs Exec. Thread(s)

GPU Jobs Exec. Thread(s)

GPU ConsolidationFramework

Register

High-Level Scheduler Framework Design

44

Front End – Back End Communication Channel

GPU1 GPUn…

Interception Library

CUDA App2

InterceptionLibrary

CUDA App1

CUDA Driver

CUDA Runtime

GPU Consolidation Framework

Front End

Back End

BackEnd Server

Dispatcher

VirtualContext

VirtualContext

Workload Consolidator

Workload Consolidator

Queues Workloads to

Dispatcher

Queues Workloads to

Virtual Context Ready Queue

Workloads arrive from Frontend

GPU Sharing Framework


45

• Cluster of 16 CPU-GPU nodes• Each CPU is 8 core Intel Xeon E5520 (2.27GHz), Main memory 48GB• Each GPU is an Nvidia Tesla C2050 (1.15 GHz), Device memory 3GB


• We use 10 benchmarks• Scientific, Financial, Datamining, Image Processing applications

• Run each benchmark with 3 different exec. Configurations• Overall, a pool of 30 jobs

Benchmarks

Baselines• TORQUE, a widely used resource manager for hetero. clusters• Minimum Completion Time (MCT), [Maheswaran et.al, HCW’99]Metrics• Completion Time• Application Latency• Average Yield

Comparison with Torque-based Metrics

46

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Comp. Time-UM Comp. Time-BM Ave. Lat-UM Ave. Lat-BM

Nor

mal

ized

over

Bes

t Cas

e

Metrics

TORQUE MCT LOR FNOR LNORP CORLP

• Baselines and our schemes use two different set of metrics

• See how our schemes perform with Torque-based metrics

• In all cases, we run 256 jobs on a 16-node cluster

10% better 22% better

• Efficient use of resources (no idle time)

• Idle time outweighs non-optimal penalty

• Worse with biased-mix (BM)

20% better

• Our schemes may prefer short jobs, reducing latency

• Also minimizes non-optimal penalty

• Also reduces queuing delay

Results with Average Yield Metric

47

0

1

2

3

4

5

6

7

8

Linear Decay Step Decay

Rela

tive

Ave.

Yie

ld

Value Decay Function


0

2

4

6

8

10

25C/75G 50C/50G 75C/25G

Rela

tive

Ave.

Yie

ld

CPU/GPU Job Mix Ratio

Torque MCT FNOR LNORP LOR CORLP

Varying CPU-GPU Job Mix

Impact of Value Decay Functions

25% CPU Jobs, 75% GPU Jobs

Up to 8.8x better

• Biased cases very high improvement

• More room for idle times and dynamic mapping

• 2.3x better for even uniform mix• Torque, no notion of value• Our schemes order the jobs for yield• Eliminates the idle time for

resources

• Adaptability of the proposed schemes to different shapes of value functions

Up to 3.8x better

Up to 6.9x better

• Step decay is more coarse-grained, hence improvement is better

Results with Average Yield Metric

48

0

2

4

6

8

10

128 256 384 512

Rela

tive

Ave.

Yie

ld

Total No. of Jobs


0

0.5

1

1.5

2

CPU(75LE/75HD) & GPU(75LE/75HD)

CPU(50LE/50HD)& GPU(75LE/75HD)

CPU(25LE/25HD) & GPU(75LE/75HD)

Rela

tive

Ave.

Yie

ld

CPU/GPU Queue Parameters Ratio

LOR FNOR LNORP CORLP

Impact of Varying Load

Coordinated Vs Uncoordinated Schemes

• As load increases, yield from baselines decreases linearly

• Proposed schemes achieve initially increased yield and then sustained yield

• As it tries to maximize the yield

Up to 8.2x better

Why do we need coordination?• Imbalance in decay rate or queuing

delays across queues

As the imbalance increase, improvement from CORLP increases

Up to 78% better

Yield Improvements from Sharing

49

0

5

10

15

20

25

0.1 0.2 0.3 0.4 0.5 0.6

Yiel

d Im

prov

emen

t (%

)

Sharing K Factor

CPU only Sharing Benefit GPU only Sharing Benefit

CPU & GPU Sharing Benefit

Effect of Sharing K Fraction

Fraction of Job to share

• Benefit from freeing a resource is always offset by the slowdown incurred by sharing jobs

• Benefit increase up to a point, then decreases (K=0.5 in this case)

• Emphasizes careful selection of K Fraction

• Up to 23% improvement due to sharing

Overhead of Sharing a CPU Core

50

0

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 Geo. Mean

Ove

rhea

d (%

)

Job Mix Number

• A CPU core is shared b/w a CPU and GPU jobs scheduled on the same node

• Overhead is within 10%• Variation depends on the amount or

frequency of data transfer/commn. b/w CPU and GPU

Summary

51

• Value-based Scheduling on CPU-GPU clusters• Goal to improve aggregate yield

• Developed novel scheduling schemes for dynamic mapping• Three Uncoordinated schemes• One Coordinated scheme

• Enable automatic sharing of resources including GPU• One novel heuristic for sharing

• Framework for evaluating the proposed schemes• Significant improvement over state-of-the-art• Based on completion time & latency• Based on average yield

52





Future Work

• Industry making move towards integrated CPU-GPU architectures– Intel recently announced SandyBridge for Servers– AMD opened up its HSA roadmap for APUs

• In HPC segment, discrete CPU-GPU will continue• Machines with integrated GPU as well as a discrete

GPU– For instance, announcement for Stampede supercomputer– Important to understand the benefits of one architecture

over the other

53

Future Work (contd.)

• OpenCL, open standard for heterogeneous computing– Gaining momentum owing to its maturity (Spafford et. al., ORNL's

Scalable Heterogeneous Computing Benchmark Suite (SHOC))– “Write kernel once, execute on many devices” is very attractive– Work distribution, communication across devices are explicit

• Build Library and Runtime support for OpenCL– Overarching Goal: Enable deployment of application(s) on a

large cluster of heterogeneous nodes– A task/work size driven approach for work distribution and

scheduling– Tasks transparently “map to” and “scale” on: multi-cores,

integrated and discrete GPUs

54

55





Thesis Conclusions

• Heterogeneity is the order of today’s computing• New challenges - Node, clusters, cloud environments

– Increased architecture complexity for developers– Lack of desired software features and mechanisms

• Runtime Library support to enable various computation patterns– Less application developer burden, improved performance

• Runtime Consolidation Framework to enable GPU Sharing– Improved global throughput in heavily shared environments

• Revisited Job Scheduling problems– Novel schemes to improve global throughput– Novel schemes to improve market-based metric

56

57

Thank You!Questions?

Benchmarks – Large Dataset

58

BenchmarksSeq. CPU Exec. (sec)

GPU Speedup (GP)

Multicore Speedup (MP)

Data set Characteristics

PDE Solver 7.3 4.7 6.814336*14336Image Processing 33.8 5.1 7.814336*14336FDTD 8.4 2.2 7.614336*14336

BlackScholes 2.6 2.1 7.210 mil optionsBinomial Options 11.8 5.6 4.21024 optionsMonteCarlo 45.4 38.4 7.91024 options

Kmeans 330.0 12.1 7.81.6 * 10 ^ 9 points

KNN 67.3 7.8 6.267108864 pointsPCA 142.0 9.7 5.6262144*80

Molecular Dynamics 46.6 12.9 7.9

256000 nodes, 31744000 edges

Benchmarks – Small Dataset

59


GPU Speedup (GP)




BlackScholes 0.7 0.6 6.82.5 mil optionsBinomial Options 3.0 2.3 4.2128 optionsMonteCarlo 11.0 9.4 7.9256 options

Kmeans 74.2 6.3 7.70.4*10 ^ 9 points

KNN 16.8 2.9 6.216777216 pointsPCA 33.8 9.1 5.665536*80


32000 nodes, 3968000 edges

Benchmarks – Large No. of Iterations

60


GPU Speedup (GP)




BlackScholes 269.1 92.8 7.810 mil optionsBinomial Options 1213.6 12.2 4.31024 optionsMonteCarlo 453.3 368.5 7.81024 options

Kmeans 1593.8 12.6 7.91.6 * 10 ^ 9 points

KNN 1691.1 58.4 6.967108864 pointsPCA 2835.7 11.8 6.2262144*80


256000 nodes, 31744000 edges

61

Frequency (%)No. of Jobs LOR FNOR LNORP CORLP

64 15.6 12.5 15.6 18.8128 10.9 11.7 11.7 14.8256 9.4 12.1 10.2 15.6512 9.6 9.6 9.0 13.1

Decay Ratio Job Type Improvement in User Satisfaction (%)MCT LOR FNOR LNORP CORLP

25% H & 75% L High Decay 5.3 78.6 84.6 83.8 104.2Low Decay 6.9 35.7 45.5 47.8 54.1

50% H & 50% H

High Decay 11.7 114.3 118.8 124.7 144.4Low Decay 11.8 58.8 58.9 66.1 72.2

75% H & 25% L High Decay 14.9 69.7 73.1 86.8 107.4Low Decay 16.2 20.5 22.4 31.3 33.8

0123456789

10

1 2 3 4 5 6 Geo. Mean

Ove

rhea

d (%

)

Job Mix Number

No. of Jobs Yield Improvement (%)128 18.2256 20.1384 22.3512 22.9

Documents

Runtime System and Scheduling Support for High-End CPU-GPU Architectures Vignesh Ravi