Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §

Preview:

Citation preview

Feedback Directed PrefetchingSanthosh Srinath

Onur MutluHyesoon KimYale N. Patt

§¥

¥§

Problem

Prefetching can significantly improve performance When prefetches are accurate And timely

However, Prefetching can also significantly degrade performance Due to Memory Bandwidth impact Pollution of the cache

HPCA-13 Feedback Directed Prefetching 2

Feedback Directed Prefetching is a comprehensive mechanism which reduces the negative effects of prefetching as well as improves the positive effects

Solution

Feedback Directed Prefetching 3

Outline

Background and Motivation

Feedback Directed Prefetching (FDP) Metrics and How to collect How to adapt

Prefetcher Aggressiveness Cache Insertion Policy for Prefetches

Results

HPCA-13

Prefetch Distance

Prefetch Degree

Predicted StreamPredicted Stream

Feedback Directed Prefetching 4

Background (Prefetcher Aggressiveness)

X

Access Stream

PmaxPrefetch Distance

PmaxVery Conservative

PmaxMiddle of the Road

PmaxVery Aggressive

P

Prefetch DegreeX+1

1 2 3

HPCA-13

Feedback Directed Prefetching 5

Background (Prefetcher Aggressiveness) Very Aggressive

Well ahead of the load access stream Hides memory access latency better More speculative

Very Conservative Closer to the load access stream Might not hide memory access latency completely Reduces potential for cache pollution and

bandwidth contention

HPCA-13

Feedback Directed Prefetching 6

0.0

1.0

2.0

3.0

4.0

5.0

Inst

ruct

ion

s p

er

Cyc

le

No PrefetchingVery Conservative

Middle-of-the-RoadVery Aggressive

Motivation

Very Aggressive improves average performance by 84% However it can also significantly reduce performance on some benchmarks

48% 29%

HPCA-13

Feedback Directed Prefetching 7

Outline

Background and Motivation

Feedback Directed Prefetching (FDP) Metrics and How to collect How to adapt

Prefetcher Aggressiveness Cache Insertion Policy for Prefetches

Results

HPCA-13 7Feedback Directed Prefetching

Feedback Directed Prefetching 8

Feedback Directed Prefetching Comprehensive mechanism which takes in

account: Prefetcher Accuracy Prefetcher Lateness Prefetcher-caused Cache Pollution

Adapts Prefetcher Aggressiveness Cache Insertion Policy for Prefetches

HPCA-13

Feedback Directed Prefetching 9

Metrics

Prefetch Accuracy

Prefetch Lateness

Prefetcher-caused Cache Pollution

HPCA-13

Feedback Directed Prefetching 10

Prefetch Accuracy

Useful Prefetches are referenced by the demand requests when in L2

Memory Sent to Prefetches ofNumber

Prefetches UsefulofNumber Accuracy Prefetcher

HPCA-13

Feedback Directed Prefetching 11

Prefetch Accuracy

Low Accuracy More likely that Prefetching can reduce performance

-100%

-50%

0%

50%

100%

150%

200%

250%

300%

350%

400%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Per

cent

age

IPC

cha

nge

ove

r N

o P

refe

tchi

ng

Prefetcher Accuracy

HPCA-13

Feedback Directed Prefetching 12

Prefetch Accuracy

Implementation pref-bit added to each L2 tag-store entry Tracked using two counters: pref_total,

used_total

pref_total

used_totalAccuracy Prefetcher

HPCA-13

Feedback Directed Prefetching 13

Prefetch Lateness

Measure of how timely prefetches are Used to determine if increasing the

aggressiveness helps Implementation

pref-bit added to each L2 MSHR entry New counter: late_total

Prefetches UsefulofNumber

Prefetches Late ofNumber LatenessPrefetch

used_total

late_total LatenessPrefetch

HPCA-13

Feedback Directed Prefetching 14

Prefetcher-caused Cache Pollution

Measure of the disturbance caused by prefetched data in the cache

Used to determine if the prefetcher is evicting useful data from the cache

Misses Demand ofNumber

Prefetcher by the caused Misses Demand ofNumber

Pollution Cache causedPrefetcher

HPCA-13

Feedback Directed Prefetching 15

Prefetcher-caused Cache Pollution (2)

Hardware Implementation Insight – this does not need to be exact Track pollution using Pollution filter

Based on Bloom Filter concept Bit set when a prefetch evicts a demand miss Bit reset when a prefetch is serviced

Two Counters – pollution_total, demand_total

aldemand_tot

totalpollution_Pollution Cache caused-Prefetcher

HPCA-13

Feedback Directed Prefetching 16

Feedback Directed Prefetching Comprehensive mechanism which takes in

account: Prefetcher Accuracy Prefetcher Lateness Prefetcher-caused Cache Pollution

Adapts Prefetcher Aggressiveness Cache Insertion Policy

HPCA-13 16Feedback Directed Prefetching

Feedback Directed Prefetching 17

How to adapt? Prefetcher Aggressiveness Dynamic Configuration Counter

Current Aggressiveness

Distance Degree

1 Very Conservative 4 1

2 Conservative 8 1

3 Middle-of-the-Road 16 2

4 Aggressive 32 4

5 Very Aggressive 64 4

HPCA-13

Improve TimelinessReduce Cache Pollution

Feedback Directed Prefetching 18

High Accuracy

Not-Late

Polluting

Decrease

Late

Increase

How to adapt? Prefetcher Aggressiveness (2)

For Current Phase, based on static thresholds, classify Accuracy Lateness Cache-Pollution caused by Prefetches

Med Accuracy

Not-Poll

Late

Increase

Polluting

Decrease

Low Accuracy

Not-Poll

Not-Late

No Change

Decrease

Reduce memory bandwidth usage and

Cache Pollution

HPCA-13

Feedback Directed Prefetching 19

How to Adapt?Cache Insertion Policy for Prefetches Why adapt?

Reduce the potential for cache pollution Classify Cache Pollution based on static

thresholds: Low – Insert at MID(n/2) Position

Eg: For a 16-way cache, MID = 8 in LRU stack Medium – Insert at LRU-4(n/4) Position

Eg: For a 16-way cache, LRU-4 = 4 in LRU stack High – Insert at LRU Position

HPCA-13

Feedback Directed Prefetching 20

Outline

Background and Motivation

Feedback Directed Prefetching Metrics and How to collect How to adapt

Prefetcher Aggressiveness Cache Insertion Policy for Prefetches

Results

HPCA-13 20Feedback Directed Prefetching

Feedback Directed Prefetching 21

Evaluation Methodology

Execution-driven Alpha simulator Aggressive out-of-order superscalar processor 1 MB, 16-way, 10-cycle unified L2 cache 500-cycle minimum main memory latency Detailed memory model

Prefetchers Modeled: Stream Prefetcher tracking 64 different streams Global History Buffer Prefetcher (in paper) PC-based Stride Prefetcher (in paper)

HPCA-13

Feedback Directed Prefetching 22

Results: Adjusting Only Aggressiveness

4.7% higher avg IPC over the Very Aggressive configuration Most of the performance losses have been eliminated

HPCA-13

Feedback Directed Prefetching 23

Results: Adjusting Only Cache Insertion Policy

5.1% better than inserting prefetches in MRU position 1.9% better than inserting prefetches in LRU-4 position

0.0

1.0

2.0

3.0

4.0

5.0

Ins

tru

cti

on

s p

er

Cy

cle

No PrefetchingLRULRU-4MIDMRUDynamic Insertion

Very Aggressive Prefetcher

HPCA-13

Feedback Directed Prefetching 24

Results: Putting it all together (FDP)

6.5% IPC improvement over Very Aggressive configuration Performance losses converted to performance gains!

11%13%

HPCA-13

BPKI - Memory Bus Accesses per 1000 retired Instructions Includes effects of L2 demand misses as well as pollution

induced misses and prefetches

FDP significantly improves bandwidth efficiency

6.5% higher performance and18.7% less bandwidth

Feedback Directed Prefetching 25

Bandwidth Impact

No. Pref. Very Cons Mid Very Aggr FDP

IPC 0.85 1.21 1.47 1.57 1.67

BPKI 8.56 9.34 10.60 13.38 10.88

13.6% higher performance with similar bandwidth usage

HPCA-13

Feedback Directed Prefetching 26

Hardware Cost

Total hardware cost 20784 bits = 2.54 KB Percentage area overhead compared to baseline

1MB L2 cache 2.5KB/1024KB = 0.24% NOT on the critical path

pref-bits for L2 cache 16384 blocks 16384 bits

Pollution Filter 4096 entries * 1bit 4096 bits

16-bit counters 11 counters 176 bits

pref-bits for MSHR 128 entries 128 bits

HPCA-13

Feedback Directed Prefetching 27

Outline

Background and Motivation

Feedback Directed Prefetching Metrics and collecting this information in

Hardware How to adapt

Results Conclusions

HPCA-13 27Feedback Directed Prefetching

Feedback Directed Prefetching 28

Contributions Comprehensive and low-cost feedback mechanism

for hardware prefetchers Uses

Prefetcher Accuracy Prefetcher Lateness Prefetcher-caused Cache Pollution

Adapts Aggressiveness Cache Insertion Policy for prefetches

6.5% higher performance and 18.7% less bandwidth compared to Very Aggressive Prefetching

Eliminates negative impact of prefetching Applicable to any data prefetch algorithm

HPCA-13

Feedback Directed Prefetching 29

Questions?

HPCA-13

Feedback Directed Prefetching 30

Backups

HPCA-13

FDP vs Prefetch Cache

Prefetch Caches eliminate prefetcher induced cache pollution

However, prefetches are now limited to the size of the prefetch cache

5.3% higher perf. than Very Aggr.+32KB Within 2% of Very Aggr.+64KB Memory bandwidth of FDP is 16% less than

32KB and 9% less than 64KB.

HPCA-13 31Feedback Directed Prefetching

Feedback Directed Prefetching 32

Performance on Other Prefetch algorithms Global History Buffer Prefetcher

20.8% less memory bandwidth than very aggressive with similar perf.

9.9% better performance than middle-of-the-road with similar bandwidth usage

PC-based Stride Prefetcher 4% better performance than the very aggressive 24% reduction in bandwidth usage

HPCA-13

IPC Performance

HPCA-13 Feedback Directed Prefetching 33

Dynamic Prefetcher Accuracy

HPCA-13 Feedback Directed Prefetching 34

Prefetch Lateness

HPCA-13 Feedback Directed Prefetching 35

Pollution Filter

HPCA-13 Feedback Directed Prefetching 36

Thresholds

HPCA-13 Feedback Directed Prefetching 37

Prefetches Sent

HPCA-13 Feedback Directed Prefetching 38

Distribution of dynamic aggressiveness level

HPCA-13 Feedback Directed Prefetching 39

Distribution of insertion position of prefetched blocks

HPCA-13 Feedback Directed Prefetching 40

Effect of FDP on memory bandwidth consumption

HPCA-13 Feedback Directed Prefetching 41

Performance of Prefetch cache vs FDP

HPCA-13 Feedback Directed Prefetching 42

Bandwidth consumption of prefetch cache vs. FDP

HPCA-13 Feedback Directed Prefetching 43

Effect of FDP on GHB

HPCA-13 Feedback Directed Prefetching 44

Effect of FDP on GHB(Bandwidth)

HPCA-13 Feedback Directed Prefetching 45

Effect of varying L2 size and memory latency

HPCA-13 Feedback Directed Prefetching 46

IPC on other benchmarks

HPCA-13 Feedback Directed Prefetching 47

BPKI on other benchmarks

HPCA-13 Feedback Directed Prefetching 48

Recommended