22
Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2 , Mary Inaba 1 , and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation

Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation

Embed Size (px)

Citation preview

Cache Replacement Policy UsingMap-based Adaptive Insertion

Cache Replacement Policy UsingMap-based Adaptive Insertion

Yasuo Ishii1,2, Mary Inaba1, and Kei Hiraki1

1 The University of Tokyo2 NEC Corporation

Introduction

Modern computers have multi-level cache system

Performance improvement of LLC is the key to achieve high performance

LLC stores many dead-blocksElimination of dead-blocks in LLC improves system performance

CORE

L2

LLC(L3)

Memory

L1

Introduction

Many multi-core systems adopt shared LLC

Shared LLC make issuesThrashing by other threadsFairness of shared resource

Dead-block elimination is more effective for multi-core systems

Shared LLC(L3)

Memory

CORE1

L2

・・・・・・

L1

CORE2

L2

L1

COREN

L2

L1

・・・

Trade-offs of Prior Works

Replacement Algorithm

Dead-blockElimination

AdditionalHW Cost

LRU Insert to MRU None NoneDIP[2007 Qureshi+]

Partially Random Insertion

Some Several counters

LightLRF[2009 Xiang+]

Predicts from reference pattern

Strong Shadow tag, PHT

Heavy

Problem of dead-block predictionInefficient use of data structure

(c.f. shadow tag)

Map-based Data Structure・・・・

・・・・Line Size

ACCESS

ACCESS

ACCESS

Shadow Tag

40bit/tag

Memory Address Space

Zone Size

Map-based data structure improves cost- efficiency when there is spatial locality

Cost: 40bit/line

I I I IIIA AA

40bit/tag 1 bit/line

Map-base HistoryCost: 15.3bit/line (=40b+6b/3line)

Map-based Adaptive Insertion (MAIP)Modifies insertion position(1)Cache bypass(2)LRU position(3)Middle of MRU/LRU(4)MRU position

Adopts map-based data structure for tracking many memory accesses

Exploits two localities for reuse possibility estimation

Low Reuse Possibility

High Reuse Possibility

Hardware ImplementationMemory access map

Collects memory access history & memory reuse history

Bypass filter tableCollects data reuse

frequency of memory access instructions

Reuse possibility estimationEstimates reuse possibility

from information of other components

Estimation Logic

Mem

ory

Acc

ess

M

ap

Bypass

Filt

er

Tab

le

Last

Level C

ach

e

Memory Access Information

Insertion Position

Memory Access Map (1)

ACCESS

・・・・

・・・・ I I I I

Init Access

DataReuse

State Diagram

FirstTouchZone Size

Line Size

II

ACCESS

ACCESS

A AA

ACCESS

Detects one information(1)Data reuse

The accessed line is previously touched ?

MapTag

AccessCount

ReuseCount

Memory Access Map (2)

A A I I

Init Access

ReuseCountAccess

Count

AI

Detects one statistics(2)Spatial locality

How often the neighboring lines are reused?

Access Map

Attaches counters to detect spatial locality

Data Reuse MetricReuse CountAccess Count

=

Memory Access Map (3)

ImplementationMaps are stored in

cache like structureCost-efficiency

Entry has 256 statesTracks 16KB memory

16KB = 64B x 256stats

Requires ~ 1.2bit for tracking 1 cache line at the best case

Tag Access Map

CacheOffset

MapOffset

MapIndex

MapTag

= =ACCESS

MUX

2563030

4

8

Memory Address

Count

Reuse Count

Bypass Filter Table

Each entry is saturating counterCount up on data reuse / Count down on first

touch

Program Counter

Bypass Filter Table(8-bit x 512-entry)

BYPASSUSELESSNORMALUSEFULREUSE

Rarely Reused

Frequently Reused

Detects one statistic(3)Temporal locality:

How often the instruction reuses data?

Reuse Possibility Estimation Logic

Uses 2 localities & data reuse informationData Reuse

Hit / Miss of corresponding lookup of LLC Corresponding state of Memory Access Map

Spatial Locality of Data Reuse Reuse frequency of neighboring lines

Temporal Locality of Memory Access Instruction Reuse frequency of corresponding instruction

Combines information to decide insertion policy

Additional OptimizationAdaptive dedicated set reduction(ADSR)

Enhancement of set dueling [2007Qureshi+]

Reduces dedicated sets when PSEL is strongly biased

Set 7

LRU Dedicated Set

Set 6Set 5Set 4Set 3Set 2Set 1Set 0

Set 7Set 6Set 5Set 4Set 3Set 2Set 1Set 0

MAIP Dedicated SetAdditional FollowerFollower Set

Evaluation

BenchmarkSPEC CPU2006, Compiled with GCC 4.2Evaluates 100M instructions (skips 40G inst.)

MAIP configuration (per-core resource)Memory Access Map: 192 entries, 12-wayBypass Filter: 512 entries, 8-bit countersPolicy selection counter: 10 bit

Evaluates DIP & TADIP-F for comparison

Cache Miss Count (1-core)

MAIP reduces MPKI by 8.3% from LRUOPT reduces MPKI by 18.2% from LRU

40

0.p

erl

40

1.b

zip

42

9.m

cf

43

3.m

ilc

43

4.z

eu

s

43

6.c

act

43

7.l

esl

45

0.s

op

l

45

6.h

mm

e

45

9.G

em

s

46

2.l

ibq

46

4.h

26

4

47

0.l

bm

47

1.o

mn

e

47

3.a

sta

48

1.w

rf

48

2.s

ph

i

48

3.x

ala

Ave

rag

e

0

20

40

60

LRU DIP MAIP OPT

Mis

s p

er

10

00

in

sts.

Speedup (1-core & 4-core)

4-core result

403429433483

429450456482

401434456470

450464473483

401433450462

401450450482

403434450464

403456459473

434450482483

400429473483

400450456462

433434450462

433450470483

433434450462

400416456464

gmean

-6%

0%

6%

12%

18%TADIP MAIP

We

igh

ted

S

pe

ed

up

400.p

erl

401.b

zip

429.m

cf

433.m

ilc

434.z

eus

436.c

act

437.lesl

450.s

opl

456..

..

459..

..

462.lib

q

464.h

264

470.lbm

471..

..

473.a

sta

481.w

rf

482.s

phi

483.x

ala

gm

ean

-10%

0%

10%

20%DIP MAIP

Sp

ee

du

p

1-core result

48

3.x

al

a

Cost Efficiency of Memory Access Map

Requires 1.9 bit / line in average~ 20 times better than that of shadow tag

Covers >1.00MB(LLC) in 9 of 18 benchmarks

Covers >0.25MB(MLC) in 14 of 18 benchmarks

40

0.p

erl

42

9.m

cf

43

4.z

eu

s

43

7.l

esl

45

6.h

mm

e

46

2.l

ibq

47

0.l

bm

47

3.a

sta

48

2.s

ph

i

Ave

rag

e0.0 0.5 1.0 1.5 2.0 2.5 3.0

Co

ve

red

Are

a (

MB

)

Related Work

Uses spatial / temporal localityUsing spatial locality [1997, Johnson+]Using different types of locality [1995,

González+]Prediction-base dead-block elimination

Dead-block prediction [2001, Lai+]Less Reused Filter [2009, Xiang+]

Modified Insertion PolicyDynamic Insertion Policy [2007, Qureshi+]Thread Aware DIP[2008, Jaleel+]

Conclusion

Map-based Adaptive Insertion Policy (MAIP)Map-base data structure

x20 cost-effectiveReuse possibility estimation exploiting

spatial locality & temporal locality Improves performance from LRU/DIP

Evaluates MAIP with simulation studyReduces cache miss count by 8.3% from LRUImproves IPC by 2.1% in 1-core, by 9.1% in

4-core

ComparisonReplacement Algorithm

Dead-blockElimination

AdditionalHW Cost

LRU Insert to MRU None NoneDIP[2007 Qureshi+]

Partially Random Insertion

Some Several countersLight

LRF[2009 Xiang+]

Predicts from reference pattern

Strong Shadow tag, PHT

HeavyMAIP Predicts based

on two localities

Strong Mem access map

Medium

Improves cost-efficiency by map data structure

Improves prediction accuracy by 2 localities

Q & A

How to Detect Insertion Position

function is_bypass()

if(Sb = BYPASS) return true if(Ca > 16 x Cr) return true return false

endfunction

function get_insert_position()

integer ins_pos=15 if(Hm) ins_pos = ins_pos/2 if(Cr > Ca) ins_pos=ins_pos/2 if(Sb=REUSE) ins_pos=0 if(Sb=USEFUL) ins_pos=ins_pos/2 if(Sb=USELESS) ins_pos=15 return ins_pos

endfunction