32
1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006

Utility-Based Partitioning of Shared Caches

  • Upload
    ronia

  • View
    41

  • Download
    1

Embed Size (px)

DESCRIPTION

Utility-Based Partitioning of Shared Caches. Moinuddin K. Qureshi Yale N. Patt. International Symposium on Microarchitecture (MICRO) 2006. Introduction. CMP and shared caches are common Applications compete for the shared cache Partitioning policies critical for high performance - PowerPoint PPT Presentation

Citation preview

Page 1: Utility-Based Partitioning of Shared Caches

1

Utility-Based Partitioning of Shared

CachesMoinuddin K.

Qureshi Yale N. Patt

International Symposium on Microarchitecture (MICRO) 2006

Page 2: Utility-Based Partitioning of Shared Caches

2

Introduction

CMP and shared caches are common

Applications compete for the shared cache

Partitioning policies critical for high performance

Traditional policies:o Equal (half-and-half) Performance isolation. No adaptation

o LRU Demand based. Demand ≠ benefit (e.g. streaming)

Page 3: Utility-Based Partitioning of Shared Caches

3

Background

Utility Uab = Misses with a ways – Misses with b ways

Low Utility

High Utility

Saturating Utility

Num ways from 16-way 1MB L2

Mis

ses

per

10

00

in

stru

ctio

ns

Page 4: Utility-Based Partitioning of Shared Caches

4

Motivation

Num ways from 16-way 1MB L2

Mis

ses

per

10

00

in

stru

ctio

ns

(MPK

I)

equakevpr

LRU

UTILImprove performance by giving more cache to the application that benefits more from cache

Page 5: Utility-Based Partitioning of Shared Caches

5

Outline

Introduction and Motivation Utility-Based Cache Partitioning Evaluation Scalable Partitioning Algorithm Related Work and Summary

Page 6: Utility-Based Partitioning of Shared Caches

6

Framework for UCP

Three components:

Utility Monitors (UMON) per core

Partitioning Algorithm (PA)

Replacement support to enforce partitions

I$

D$Core1

I$

D$Core2

SharedL2 cache

Main Memory

UMON1 UMON2PA

Page 7: Utility-Based Partitioning of Shared Caches

7

Utility Monitors (UMON) For each core, simulate LRU policy using ATD

Hit counters in ATD to count hits per recency position

LRU is a stack algorithm: hit counts utility E.g. hits(2 ways) = H0+H1

MTD

Set B

Set E

Set G

Set A

Set CSet D

Set F

Set H

ATD

Set B

Set E

Set G

Set A

Set CSet D

Set F

Set H

++++(MRU)H0 H1 H2…H15(LRU)

Page 8: Utility-Based Partitioning of Shared Caches

8

Dynamic Set Sampling (DSS)

Extra tags incur hardware and power overhead

DSS reduces overhead [Qureshi+ ISCA’06]

32 sets sufficient (analytical bounds)

Storage < 2kB/UMONMTD

ATD Set B

Set E

Set G

Set A

Set CSet D

Set F

Set H

++++(MRU)H0 H1 H2…H15(LRU)

Set B

Set E

Set G

Set A

Set CSet D

Set F

Set H

Set B

Set E

Set G

Set A

Set CSet D

Set F

Set H

Set BSet ESet G

UMON (DSS)

Page 9: Utility-Based Partitioning of Shared Caches

9

Partitioning algorithm

Evaluate all possible partitions and select the best

With a ways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from

UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 Select a that maximizes (Hitscore1 + Hitscore2)

Partitioning done once every 5 million cycles

Page 10: Utility-Based Partitioning of Shared Caches

10

Way Partitioning

Way partitioning support: [Suh+ HPCA’02, Iyer ICS’04]

1. Each line has core-id bits

2. On a miss, count ways_occupied in set by miss-causing app

ways_occupied < ways_given

Yes No

Victim is the LRU line from other app

Victim is the LRU line from miss-causing app

Page 11: Utility-Based Partitioning of Shared Caches

11

Outline

Introduction and Motivation Utility-Based Cache Partitioning Evaluation Scalable Partitioning Algorithm Related Work and Summary

Page 12: Utility-Based Partitioning of Shared Caches

12

Methodology

Configuration: Two cores: 8-wide, 128-entry window, private L1s L2: Shared, unified, 1MB, 16-way, LRU-based Memory: 400 cycles, 32 banks

Used 20 workloads (four from each type)

Benchmarks: Two-threaded workloads divided into 5

categories1.0 1.2 1.4 1.6 1.8 2.0

Weighted speedup for the baseline

Page 13: Utility-Based Partitioning of Shared Caches

13

Metrics

Three metrics for performance:

1. Weighted Speedup (default metric) perf = IPC1/SingleIPC1 + IPC2/SingleIPC2

correlates with reduction in execution time

2. Throughput perf = IPC1 + IPC2

can be unfair to low-IPC application

3. Hmean-fairness perf = hmean(IPC1/SingleIPC1, IPC2/SingleIPC2)

balances fairness and performance

Page 14: Utility-Based Partitioning of Shared Caches

14

Results for weighted speedup

UCP improves average weighted speedup by 11%

Page 15: Utility-Based Partitioning of Shared Caches

15

Results for throughput

UCP improves average throughput by 17%

Page 16: Utility-Based Partitioning of Shared Caches

16

Results for hmean-fairness

UCP improves average hmean-fairness by 11%

Page 17: Utility-Based Partitioning of Shared Caches

17

Effect of Number of Sampled Sets

Dynamic Set Sampling (DSS) reduces overhead, not

benefits

8 sets16 sets32 setsAll sets

Page 18: Utility-Based Partitioning of Shared Caches

18

Outline

Introduction and Motivation Utility-Based Cache Partitioning Evaluation Scalable Partitioning

Algorithm Related Work and Summary

Page 19: Utility-Based Partitioning of Shared Caches

19

Scalability issues

Time complexity of partitioning low for two cores(number of possible partitions ≈ number of ways)

Possible partitions increase exponentially with cores

For a 32-way cache, possible partitions: 4 cores 6545 8 cores 15.4 million

Problem NP hard need scalable partitioning algorithm

Page 20: Utility-Based Partitioning of Shared Caches

20

Greedy Algorithm [Stone+ ToC ’92]

GA allocates 1 block to the app that has the max utility for one block. Repeat till all blocks allocated

Optimal partitioning when utility curves are convex

Pathological behavior for non-convex curves

Num ways from a 32-way 2MB L2

Mis

ses

per

100 inst

ruct

ions

Page 21: Utility-Based Partitioning of Shared Caches

21

Problem with Greedy Algorithm

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6 7 8

A

B

In each iteration, the utility for 1 block:

U(A) = 10 misses U(B) = 0 misses

Problem: GA considers benefit only from the immediate block. Hence it fails to exploit huge gains from ahead

Blocks assigned

Mis

ses

All blocks assigned to A, even if B has same miss reduction with fewer blocks

Page 22: Utility-Based Partitioning of Shared Caches

22

Lookahead Algorithm

Marginal Utility (MU) = Utility per cache resource MUa

b = Uab/(b-a)

GA considers MU for 1 block. LA considers MU for all possible allocations

Select the app that has the max value for MU. Allocate it as many blocks required to get max MU

Repeat till all blocks assigned

Page 23: Utility-Based Partitioning of Shared Caches

23

Lookahead Algorithm (example)

Time complexity ≈ ways2/2 (512 ops for 32-ways)

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6 7 8

A

B

Iteration 1:MU(A) = 10/1 block MU(B) = 80/3 blocks

B gets 3 blocks

Result: A gets 5 blocks and B gets 3 blocks (Optimal)

Next five iterations: MU(A) = 10/1 block MU(B) = 0A gets 1 block

Blocks assigned

Mis

ses

Page 24: Utility-Based Partitioning of Shared Caches

24

Results for partitioning algorithms

Four cores sharing a 2MB 32-way L2

Mix2(swm-glg-mesa-prl)

Mix3(mcf-applu-art-vrtx)

Mix4(mcf-art-eqk-wupw)

Mix1(gap-applu-apsi-

gzp)

LA performs similar to EvalAll, with low time-complexity

LRUUCP(Greedy)UCP(Lookahead)UCP(EvalAll)

Page 25: Utility-Based Partitioning of Shared Caches

25

Outline

Introduction and Motivation Utility-Based Cache Partitioning Evaluation Scalable Partitioning Algorithm Related Work and Summary

Page 26: Utility-Based Partitioning of Shared Caches

26

Related work

Zhou+ [ASPLOS’04] Perf += 11%

Storage += 64kB/coreX

UCP Perf += 11% Storage += 2kB/core

Suh+ [HPCA’02] Perf += 4%

Storage += 32B/core

Performance

Low

High

Overhead

Low High

UCP is both high-performance and low-overhead

Page 27: Utility-Based Partitioning of Shared Caches

27

Summary

CMP and shared caches are common

Partition shared caches based on utility, not demand

UMON estimates utility at runtime with low overhead

UCP improves performance:

o Weighted speedup by 11%o Throughput by 17% o Hmean-fairness by 11%

Lookahead algorithm is scalable to many cores sharing a highly associative cache

Page 28: Utility-Based Partitioning of Shared Caches

28

Questions

Page 29: Utility-Based Partitioning of Shared Caches

29

DSS Bounds with Analytical ModelUs = Sampled mean (Num ways allocated by DSS) Ug = Global mean (Num ways allocated by Global)

P = P(Us within 1 way of Ug)

By Cheb. inequality:

P ≥ 1 – variance/n

n = number of sampled sets

In general, variance ≤ 3

back

Page 30: Utility-Based Partitioning of Shared Caches

30

Phase-Based Adapt of UCP

Page 31: Utility-Based Partitioning of Shared Caches

31

Galgel – concave utility

galgeltwolfparser

Page 32: Utility-Based Partitioning of Shared Caches

32

LRU as a stack algorithm