Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1

Scalable Load and Store Processing in Latency Tolerant Processors

Amit Gandhi1,2

Haitham Akkary1

Ravi Rajwar1

Srikanth T. Srinivasan1

Konrad Lai1

1Intel2Portland State University

2

Problem: tolerating miss latencies

• Increasing miss latencies to memory– large instruction windows tolerate latencies– naïve window scaling impractical

• Resource efficient large instruction windows– sustain 1000s of instructions in-flight– need small register files and schedulers– do not address memory buffers efficiency

Must track all memory operationsMemory consistency, ordering, and forwarding

3

Why is this a problem?

• Memory operations tracked in load & store buffers– buffers require CAMs for scanning and matching– CAMs have high area and power requirements

• Don’t always need large memory buffers– L2 cache hit small buffers sufficient– L2 cache miss large buffers necessary

• Scaling CAM is difficult• Why pay the price when not necessary?

Must eliminate CAMs from large buffers

4

Loads: Unordered buffer

• Hierarchical load buffers• Conventional level one load buffer

– effective in the absence of a miss

• Un-ordered level two load buffer– used only when long latency miss occurs– set-associative cache structure

• no scan, only indexed lookup necessary

– does not track precise order of loads• sufficient to know if violation occurred (not where)• checkpoint rollback

5

Stores: CAM-free buffers

• Hierarchical store queue• Conventional level one store queue

– effective in the absence of a miss

• CAM-free level two store queue– used only when long latency miss occurs– used only for ordering

no scanning or matching necessary in queue

Decouple ordering from forwarding

1. Redo stores to enforce order2. Forward from cache instead of queue

6

Outline

• Motivation• Resource efficient processors

– Continual Flow Pipelines– memory buffer demands

• Store processing• Results• Summary

7

Implications of a miss

• Long latency misses to memory– place pressure on critical resources– pipeline quickly stalls due to blocked resources

• Large instruction window processors– execute useful instructions in shadow of miss– tolerate latency by overlapping miss with useful work– naïve scaling impractical

• Resource-efficient instruction windows– scale window to thousands– do not require scaled cycle-critical structures

8

Resource-efficient latency tolerance

Significant fraction of instructions in the shadow of a miss are independent of the miss

Exploit above program property

Treat and process miss-dependent and miss-independent instructions differently

9

Continual Flow Pipeline processor

• Miss dependent instructions– release critical resources

– leave pipeline, and wait outside pipeline in slice buffer

• Miss independent instructions – execute

– release critical resources and retire

• When miss returns– miss-dependent instructions re-acquire resources

– execute and retire

• After miss-dependent instructions execute– results automatically integrated

10

Continual Flow Pipeline processor

• Critical resource efficient– don’t require large register files, large schedulers

• Need to track all memory operations– large load buffer large CAM footprint and power– hierarchical store queue

• small, fast L1 store queue (32 entries)• large, slow L2 store queue (~512 entries)

large CAM foot print

high leakage power• good performance

11

Why track all memory operations?

• Stores must update in program order• Load/store dependence speculation• Multiprocessor memory consistency

• Continual Flow Pipeline processors– execute independents ahead of dependents– aggressively reorder memory operations execution

12

Outline

• Motivation• Resource efficient processors• Store processing

– store queue overview– SRL key idea– SRL workings

• Results• Summary

13

Functions of a store queue

• Ordering– ensure memory updates are in program order– correctness

• Forwarding– provide data to subsequent loads– performance– CAM

X

ZY

YK

X

ZY

YK

A D

STQ

Z

LD

A DFwd. data

ZMatch

14

Conventional store queue

• Single structure for ordering, forwarding• Large sizes increase CAM area & leakage

– CAM contribution to area and power dominates

Efficiency Eliminate CAMs

15

Decoupling ordering from forwarding

CAM

L2 STQ

A D

A D

SRAM

Store Redo Log (SRL)

• FIFO• Program Order• No CAM

Data Cache•Forwarding•No CAM

No CAMs for ordering/forwarding!

16

Store Redo Log workings (1)

In shadow of a miss• Allocate FIFO L2 store queue (SRL) entry for all stores

– records program order for stores• Dependent stores

– not ready, release L1 store queue entry, and enter SRL• Independent stores

– update cache temporarily, and enter SRL• Loads

– independent loads forward from cache & retire– dependent loads go to slice buffer– do not scan L2 store queue for forwarding

17

Store Redo Log workings (2)

When miss returns• Discard all independent store updates to cache

– these stores don’t re-execute– their dependents don’t re-execute

• Drain the SRL in program order– reconstruct memory live-outs– program order maintained– no re-execution, only re-update

• no extra cache ports required

18

Hazards

• Write after Write (WAW)• Write after Read (WAR)• Read After Write (RAW)

19

Handling hazards: WAW

ST X 12

ST X

ST Y 17

Y 2

X 38

17

12

ST X 5

512

SRLCache

L1 STQ

Miss returns

ST X ST Y ST X

Program Order

20

Handling hazards: WAR

LD X ST X ST Y

Program Order

ST X 5 LD

ST Y 17

Y

X

2

385

17

LD X38

L1 STQ L1 LDQ Slice Buffer

SRLCache

Miss returns

21

Handling hazards: RAW

• Detect by snooping completed stores• Restart execution in case of violations

– restore to checkpoint

22

Outline

• Motivation• Latency tolerant processor background• Store processing• Results• Summary

23

Evaluation

• Ideal store queue– large L1 STQ (Latency = 3 cycles)– gives upper-bound (impractical to build)

• Hierarchical store queue– L1 STQ (Latency = 3 cycles)– L2 STQ (with CAMs) (Latency = 8 cycles)

• SRL store processing– L1 STQ (Latency = 3 cycles)– FIFO CAM-free Store Redo Log

• Baseline– L1 STQ (Latency = 3 cycles)

24

SRL performance

0

5

10

15

20

25

30

SFP2K SINT2K WEB MM PROD SERVER WS

% S

pee

du

p o

ver

Bas

elin

e

SRL store processing

Hierarchical STQ

Ideal STQ

Performance within 6% of ideal store queue

25

Power and area comparison

• Hierarchical store queue – 90nm CMOS technology– SPICE simulations – circuit optimized to reduce leakage power– banked structure to reduce dynamic power

• SRL over Hierarchical STQ– more than 50% reduction in leakage power– more than 90% reduction in dynamic power– 75% reduction in the area

26

Summary

• CAM-free secondary structures• Set-associative L2 Load buffer• FIFO L2 Store queue

– Don’t constantly enforce order– Ensure correct order by redoing the stores

• 75% area and 50% leakage power savings• No CAM scalable design

Documents

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1