Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications

Application-aware Memory System for Fair and Efficient Execution of Concurrent

GPGPU Applications

Adwait Jog1, Evgeny Bolotin2, Zvika Guz2,a, Mike Parker2,b, Steve Keckler2,3, Mahmut Kandemir1, Chita Das1

Penn State1, NVIDIA2, UT Austin3, now at (Samsunga, Intelb)

GPGPU Workshop @ ASPLOS 2014

Era of Throughput ArchitecturesGPUs are scaling: Number of CUDA Cores, DRAM bandwidth

GTX 780 Ti(Kepler) 2880 cores

(336 GB/sec)

GTX 275 (Tesla) 240 cores (127 GB/sec)

GTX 480 (Fermi) 448 cores

(139 GB/sec)

Prior Approach (Looking Back)

Execute one kernel at a timeWorks great, if kernel has enough parallelism

SM-1 SM-2 SM-30 SM-31 SM-32 SM-X

Single Application

Memory

Cache

Interconnect

Current Trend

What happens when kernels do not have enough threads?Execute multiple kernels (from same application/context) concurrently

4

KeplerFermi

CURRENT ARCHITECTURES SUPPORT THIS FEATURE

Future Trend (Looking Forward)

SM-1 SM-A

Application-1

Memory

Cache

Interconnect

SM-A+1

SM-B

Application-2SM-B+1

SM-X

Application-N

We study execution of multiple kernels from

multiple applications (contexts)

Why Multiple Applications (Contexts)?

Improves overall GPU throughput

Improves portability of multiple old apps (with limited thread-scalability) on newer scaled GPUs

Supports consolidation of multiple-user requests on to the same GPU

We study two applications scenarios

2. Co-scheduling two apps Assumed equal partitioning, 30 SM + 30 SM

7

SM-1 SM-30

Application-1

Memory

Cache

Interconnect

SM-31 SM-60

Application-2

SM-1 SM-60

Single Application (Alone)

Memory

Cache

Interconnect

SM-2 SM-3 SM-59

1. One application runs alone on 60 SM GPU (Alone_60)

8

MetricsInstruction Throughput (Sum of IPCs)

IPC (App1) + IPC (App2) + …. IPC (AppN)

Weighted SpeedupWith co-scheduling:

Speedup (App-N) = Co-scheduled IPC (App-N) / Alone IPC (App-N)Weighted Speedup = Sum of speedups of ALL apps

Best case:Weighted Speedup = N (Number of apps)

With destructive interferenceWeighted Speedup can be between 0 to N

Time-slicing – running alone:Weighted Speedup = 1 (Baseline)

OutlineIntroduction and motivation

Positives and negatives of co-scheduling multiple applications

Understanding inefficiencies in memory-subsystem

Proposed DRAM scheduler for better performance and fairness

Evaluation

Conclusions

Positives of co-scheduling multiple apps

Weighted Speedup = 1.4, when HIST is concurrently executed with DGEMM

40% improvement over running alone (time-slicing)

Gain in weighted speedup (application throughput)

Co-schedu...0

0.5

1

1.5HIST+DGEMM

Wei

ghte

d Sp

eedu

p

Baseline

Alone_60 With DGEMM0

0.20.40.60.8

1

HIST

Spee

dup

Alone_60 With HIST0

0.20.40.60.8

1DGEMM

Spee

dup

Unequal performance degradation indicates unfairness in the system

With DGEMM With GUPS0

0.20.40.60.8

1HIST Performance

Spee

dup

Negatives of co-scheduling multiple apps (1)(A) Fairness

GAUSS+GUPS: Only 2% improvement in weighted speedup, over running alone

Negatives of co-scheduling multiple apps (2)(B) Weighted speedup (Application Throughput)

With destructive InterferenceWeighted speedup can be between 0 to 2 (can also go below baseline = 1)

HIST+DGEMM HIST+GUPS GAUSS+GUPS0

0.5

1

1.5W

eigh

ted

Spee

dup Baseline

hist

_gau

ss

hist

_gup

s

hist

_bfs

hist

_3ds

hist

_dge

mm

gaus

s_gu

ps

gaus

s_bf

s

gaus

s_3d

s

gaus

s_dg

emm

gups

_bfs

gups

_3ds

gups

_dge

mm

bfs_

3ds

bfs_

dgem

m 0

0.20.40.60.8

11.21.41.6

1st APP 2nd APP

Wei

ghte

d Sp

eedu

p

Highlighted workloads: Exhibit unfairness (imbalance in red-green portions) & low throughputNaïve coupling of 2 apps is probably not a good idea

Summary: Positives and Negatives

Baseline





Evaluation

Conclusions

Primary Sources of Inefficiencies Application Interference at many levels

L2 CachesInterconnectDRAM (Primary Focus of this work)

SM-1 SM-A

Application-1

Memory

Cache

Interconnect

SM-A+1

SM-B

Application-2

SM-B+1

SM-X

Application-N

Bandwidth Distribution

Bandwidth intensive applications (e.g. GUPS) takes majority of memory bandwidth

alon

e_30

alon

e_60

gaus

sgu

ps bfs

3ds

dgem

m

alon

e_30

alon

e_60 hist

gups bf

s3d

sdg

emm

alon

e_30

alon

e_60 hist

gaus

sbf

s3d

sdg

emm

alon

e_30

alon

e_60 hist

gaus

sgu

ps 3ds

alon

e_30

alon

e_60 hist

gaus

sgu

ps bfs

dgem

m

alon

e_30

alon

e_60 hist

gaus

sgu

ps 3ds

HIST (1st App) GAUSS (1st App) GUPS (1st App) BFS (1st App) 3DS (1st App) DGEMM (1st App)

0%

20%

40%

60%

80%

100%1st App 2nd App Wasted-BW Idle-BW

Perc

enta

ge o

f Pe

ak B

andw

idth

Red portion is the fraction of wasted DRAM cycles during which data is not transferred over bus

17

hist

_gau

ss

hist

_gup

s

hist

_bfs

hist

_3ds

hist

_dge

mm

gaus

s_gu

ps

gaus

s_bf

s

gaus

s_3d

s

gaus

s_dg

emm

gups

_bfs

gups

_3ds

gups

_dge

mm

bfs_

3ds

bfs_

dgem

m 0

0.20.40.60.8

11.21.41.6

1st APP 2nd APP

Wei

ghte

d Sp

eedu

p

Imbalance in green and red portions indicates unfairness

Revisiting Fairness and ThroughputBaseline

18

Agnostic to different requirements of memory requests coming from different applications

Leads to – Unfairness– Sub-optimal performance

Primarily focus on improving DRAM efficiency

Current Memory Scheduling Schemes

19

Simple FCFS

Time

Bank

R1

R1

R1

R2

R2

R2

R3

Row Switch

Row Switch

Row Switch

Row Switch

Commonly Employed Memory Scheduling Schemes

High DRAM Page Hit Rats

Time

Bank

R1

R1

R1

R2

R2

R2

R3

Row Switch

Row Switch

App-1 App-2

R1 R2 R3

Request toRow-1 Row-2 Row-3

Low DRAM Page Hit RateOut of order (FR-FCFS)

Both schedulers are application agnostic! (App-2 suffers)

20





Evaluation

Conclusions

21

As an example of adding application-awarenessInstead of FCFS, schedule requests in Round-Robin FashionPreserve the page hit rates

Proposal:FR-FCFS (Baseline) FR-(RR)-FCFS (Proposed)Improves FairnessImproves Performance

Proposed Application-Aware Scheduler

22

Proposed Application-Aware FR-(RR)-FCFS SchedulerApp-1 App-2

Time

Bank

R1 R2 R3

Request toRow-1 Row-2 Row-3

R1

R1

R1

R2

R2

R2

R3

Row Switch

Row Switch Time

Bank

R3

R1

R1

R1Row Switch

R2

R2

R2

Row Switch

App-2 is scheduled after App-1 in Round-Robin order

Baseline FR-FCFS Proposed FR-(RR)-FCFS

DRAM Page Hit-Rates

hist_g

auss

hist_g

ups

hist_bfs

hist_3

ds

hist_d

gemm

gaus

s_gu

ps

gaus

s_bfs

gaus

s_3d

s

gaus

s_dg

emm

gups

_bfs

gups

_3ds

gups

_dgem

m

bfs_3

ds

3ds_

dgemm

30%40%50%60%70%80%90%

FR-FCFS FR-RR-FCFS P

age

Hit

Rat

es

Same Page Hit-Rates as Baseline (FR-FCFS)

24





Evaluation

Conclusions

Simulation Environment

GPGPU-Sim (v3.2.1)

Kernels from multiple applications are issued to different concurrent CUDA Streams

14 two-application workloads considered with varying memory demands

Baseline configuration similar to scaled-up version of GTX480 60 SMs, 32-SIMT lanes, 32-threads/warp 16KB L1 (4-way, 128B cache block) + 48KB SharedMem per SM 6 partitions/channels (Total Bandwidth: 177.6 GB/sec)

Improvement in Fairness

Fairness = max (r1, r2) Index

r1 = Speedup(app1)

Speedup(app2)

r2 = Speedup(app2)

Speedup(app1)

hist

_gau

ss

hist

_gup

s

hist

_bfs

hist

_3ds

hist

_dge

mm

gaus

s_gu

ps

gaus

s_bf

s

gaus

s_3d

s

gaus

s_dg

...

gups

_bfs

gups

_3ds

gups

_dg.

..

bfs_

3ds

3ds_

dgem

m 0

2

4

6

8

10

12FR-FCFS FR-RR-FCFS

Fair

ness

Inde

x

On average 7% improvement (up to 49%) in fairness

Significantly reduces the negative impact of BW sensitive applications (e.g. GUPS) on overall fairness of the GPU system

Lower is Better

Improvement in Performance (Normalized to FR-FCFS)

hist_g

auss

hist_g

ups

hist_b

fs

hist_3

ds

hist_d

gemm

gaus

s_gu

ps

gaus

s_bfs

gaus

s_3d

s

gaus

s_dg

emm

gups

_bfs

gups

_3ds

gups

_dge

mm

bfs_3

ds

3ds_

dgem

m 0.8

0.85

0.9

0.95

1

1.05

1.1

Norm

aliz

ed W

eigh

ted

Spee

dup

hist_g

auss

hist_g

ups

hist_b

fs

hist_3

ds

hist_d

gemm

gaus

s_gu

ps

gaus

s_bfs

gaus

s_3d

s

gaus

s_dg

emm

gups

_bfs

gups

_3ds

gups

_dge

mm

bfs_3

ds

3ds_

dgem

m 0.80.9

11.11.21.31.41.51.61.7

Norm

aliz

ed In

stru

ctio

n Th

roug

hput

On average 10% improvement (up to 64%) in instruction throughput performance and up to 7% improvement in weighted speedup performance.

Significantly reduces the negative impact of BW sensitive applications (e.g. GUPS) on overall performance of the GPU system

Instruction Throughput Weighted Speedup

Bandwidth Distribution with Proposed Scheduler

alon

e_30

alon

e_60

fr-fc

fs-g

ups

fr-rr

-fcfs

-gup

s

alon

e_30

alon

e_60

fr-fc

fs_g

ups

fr-rr

-fcfs

-gup

s

alon

e_30

alon

e_60

fr-fc

fs-g

ups

fr-rr

-fcfs

-gup

s

HIST (1st App) GAUSS (1st App) 3ds (1st App)

0%

20%

40%

60%

80%

100%

1st App 2nd App Wasted-BW Idle-BWPe

rcen

tage

of

Peak

Ban

dwid

th

Lighter applications get better DRAM bandwidth share

Conclusions

Naïve coupling of applications is probably not a good ideaCo-scheduled applications interfere in the memory-subsystemSub-optimal Performance and Fairness

Current DRAM schedulers are agnostic to applicationsTreat all memory request equally

Application-aware memory system is required for enhanced performance and superior fairness

30

Thank You!

Questions?

Documents

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications