27
Using Virtual Load/Store Queues Using Virtual Load/Store Queues (VLSQs) to Reduce (VLSQs) to Reduce The Negative Effects of Reordered The Negative Effects of Reordered Memory Instructions Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer Engineering, University of Maryland, College Park {ajaleel, blj} @ eng.umd.edu

Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

Embed Size (px)

Citation preview

Page 1: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

Using Virtual Load/Store Queues Using Virtual Load/Store Queues (VLSQs) to Reduce(VLSQs) to Reduce

The Negative Effects of Reordered The Negative Effects of Reordered Memory InstructionsMemory Instructions

Aamer Jaleel and Bruce JacobElectrical and Computer Engineering,University of Maryland, College Park

{ajaleel, blj} @ eng.umd.edu

Page 2: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Paper Motivation• Maximizing Application ILP:

– OoO performance depends on size of instruction window or reorder buffer (ROB)

– Improve ILP by larger ROB sizes

• Before This Paper:– Many studies have showed large performance gains with

large ROBs– Most have discounted real effects in memory subystem

Page 3: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Paper Contributions• Uncovering A Problem:

– Increasing OoO capability degrades memory system performance

• Increase in replay traps • Increase in L1 cache misses

• The Reason:– OoO scheduler reordering memory instructions

• The Solution:– Restrict reordering of memory instructions – Virtual Load/Store Queue (VLSQ)

Page 4: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Background – Replay Traps• Hardware events to ensure correct

execution order of memory instructions

• Types of Replay Traps– Load-Store Replay Trap– Wrong-Size Replay Trap– Load-Load Replay Trap– Load-Miss Load Replay Trap

Load-Store Replay

2. ST BYTE A (3)

3. LD BYTE A (2)

1. LD BYTE A (1)

4. LD BYTE B (4)

Wrong Size Replay

2. ST BYTE A (2)

3. LD HALF A (3)

1. LD BYTE A (1)

4. LD BYTE B (4)

Load-Miss Load Replay

3. LD BYTE A (3)

2. ST BYTE A (2)1. +LD BYTE A (1)

4. LD BYTE B (4)

P2P1

2. ST BYTE A (2)

3. LD BYTE A (1)1. LD BYTE A (4)

4. LD BYTE B (3)

P2P1

Load-Load Replay

Page 5: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Experimental Framework

• Simulator:– Sim-Alpha– 64K 2-Way IL1/DL1, 2MB 4-Way L2, 8 MSHRS / cache– Branch predictor: 4K BTB, and 2K hybrid g-share/bimodal– 1024-entry store-wait predictor– Hardware data prefetcher: 2-Way 256-entry stride table and eight

8-entry stream buffers– Detailed DDR2 DRAM model with queuing delays

• Benchmarks– SPEC2000

Page 6: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

The Problem w/↑ OoO Capability

• Replay Traps:– Trap frequency increases by a factor of 5– Trap overhead increases by 10-60%

• L1 Cache Misses:– Number of cache misses increase by 15% (average)– fma3d, mesa, wupwise, eon, vpr, twolf, swim (20% – 40%)

Traps / 1000 Instructions

ROB-80 ROB-512ROB-128 ROB-256

% Increase in L1 Cache Misses(compared to ROB 80)

ROB-80 ROB-512ROB-128 ROB-256

Page 7: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Why The Problem? • OoO execution reorders both ALU and memory

instructions• Replay traps and cache misses are problems

associated with memory instructions• Hypothesis:

– Reordering of ALU Instructions poses little or no threats

BUT– Reordering of memory instructions causes the problem

Page 8: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

How many are issued in-order?

• 10 to 20% of memory instructions are issued in order with increased OoO capability

Need to reduce reordering of memory instructions

0 W-WDistance From Being Issued In Program Order

% M

emo

ry I

nst

ruct

ion

s 55%

10%15%

21%

Issued Late Issued Early

In-order Issue

ROB 80ROB 128ROB 256ROB 512

Page 9: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Virtual Load/Store Queue (VLSQ)

• Traditional LSQ: Any ready instruction is issued

Traditional Load/Store Queue

MEM 0MEM 1MEM 2MEM 3MEM 4MEM 5

.MEM N-1MEM N

LSQ HEAD

LSQ TAIL

ISSUED READY NOT READY

Page 10: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Virtual Load/Store Queue (VLSQ)

• Traditional LSQ: Any ready instruction is issued• Virtual LSQ: Only issue instructions residing in a

virtual window

Traditional Load/Store Queue Virtual Load/Store Queue

MEM 0MEM 1MEM 2MEM 3MEM 4MEM 5

.MEM N-1MEM N

LSQ HEAD

LSQ TAIL

Virtual WindowSize = Inf

VIRTUAL HEAD

MEM 0MEM 1MEM 2MEM 3MEM 4

.MEM N-1MEM N

LSQ HEAD

LSQ TAIL

VIRTUAL TAILMEM 5

Virtual WindowSize = 4

ISSUED READY NOT READY

Page 11: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Virtual Load/Store Queue (VLSQ)

• Traditional LSQ: Any ready instruction is issued• Virtual LSQ: Only issue instructions residing in a

virtual window

Traditional Load/Store Queue Virtual Load/Store Queue

MEM 0MEM 1MEM 2MEM 3MEM 4MEM 5

.MEM N-1MEM N

LSQ HEAD

LSQ TAIL

Virtual WindowSize = Inf

VIRTUAL HEAD

MEM 0MEM 1MEM 2MEM 3MEM 4

.MEM N-1MEM N

LSQ HEAD

LSQ TAIL

VIRTUAL TAILMEM 5

Virtual WindowSize = 4

ISSUED READY NOT READY

Page 12: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Virtual Load/Store Queue (VLSQ)

• Traditional LSQ: Any ready instruction is issued• Virtual LSQ: Only issue instructions residing in a

virtual window• Virtual window slides down only when instruction at

virtual head is issued

Traditional Load/Store Queue Virtual Load/Store Queue

MEM 0MEM 1MEM 2MEM 3MEM 4MEM 5

.MEM N-1MEM N

LSQ HEAD

LSQ TAIL

Virtual WindowSize = Inf

VIRTUAL HEAD

MEM 0MEM 1MEM 2MEM 3MEM 4

.MEM N-1MEM N

LSQ HEAD

LSQ TAIL

VIRTUAL TAILMEM 5

Virtual WindowSize = 4

ISSUED READY NOT READY

Page 13: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

MEM 3

Virtual Load/Store Queue (VLSQ)

• Traditional LSQ: Any ready instruction is issued• Virtual LSQ: Only issue instructions residing in a

virtual window• Virtual window slides down only when instruction at

virtual head is issued

Traditional Load/Store Queue Virtual Load/Store Queue

MEM 0MEM 1MEM 2MEM 3MEM 4MEM 5

.MEM N-1MEM N

LSQ HEAD

LSQ TAIL

Virtual WindowSize = Inf

VIRTUAL HEAD

MEM 0MEM 1MEM 2

MEM 4

.MEM N-1MEM N

LSQ HEAD

LSQ TAIL

VIRTUAL TAIL

MEM 5Virtual WindowSize = 4

ISSUED READY NOT READY

Page 14: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

VLSQs: Replay Trap Stats

• ↑ OoO Aggressiveness (ROB from 80 512 entries)– 5X increase in trap frequency

• VLSQs reduce trap frequency by factors of 2-30

– 25-60% of total execution time spent in traps

• VLSQs reduce total time handling traps by 10-40%

Direct correlation between memory ordering and replay traps

ROB-80 ROB-512ROB-128 ROB-256 ROB-80 ROB-512ROB-128 ROB-256

Replay Traps / 1000 Instructions Replay Trap Penalty

Inf643216

841

Page 15: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

VLSQs: DL1 Cache Stats

• ↑ OoO Aggressiveness (ROB from 80 512 entries)– 55% Increase in L1 Cache Accesses

• VLSQs reduce cache accesses by upto 55%

– 15% Increase in L1 Cache Misses

• VLSQs reduce cache misses by upto 10%

Direct correlation between memory ordering and cache accesses

ROB-80 ROB-512ROB-128 ROB-256 ROB-80 ROB-512ROB-128 ROB-256

Normalized Accesses Normalized Misses

Inf643216

841

Page 16: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

VLSQ Performance

• Applications show three different behaviors– Group I: Performance same – non-memory intensive apps– Group II: Performance loss – memory intensive apps– Group III: Performance benefit – alleviating negative effects

• VLSQ of size 16 or 32 is ideal across all apps

Inf

64

3216

8

41

VLSQ Sizes

ROB-512 ROB-512 ROB-512

CPICPICPI

MEMORYALU

OTHER

GROUP IIIGROUP IIGROUP I

Page 17: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Power Savings with VLSQs

• Reducing Replay Traps– 5-60% power savings in fetch/map/exec hardware

• Reducing Cache Accesses and Misses– 5-65% savings in L1 data cache

• Savings of 25-30% using VLSQs of 16 or 32

VLSQ 64VLSQ 32

VLSQ 4VLSQ 16

VLSQ 1VLSQ 8

Execution Units(Normalized to Inf)

L1 Cache(Normalized to Inf)

ROB 080ROB 128ROB 256ROB 512

VLSQ 64VLSQ 32

VLSQ 4VLSQ 16

VLSQ 1VLSQ 8

Page 18: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Windowing of Load/Store Queue

• Static Mechanism (This Study):– Statically set the size of the virtual window– Drawback: Memory ILP lost during execution phase

where negative effects do not exist

• Dynamic Mechanism (Future Work):– Intuition that negative effects do not always exist– Dynamically vary virtual window size based on

application execution behavior• Virtual window initially infinite

• Vary window size based on certain thresholds

Page 19: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Summary• In This Paper :

– Problem: Increasing in replay traps and cache misses– Reason: Reordering of memory instructions– Solution: Virtual Load/Store Queues (VLSQs)

• Points To Take Home:

– Mechanism to improve performance causes degradation in the memory subsystem

– OoO cores shouldn’t always be on full throttle –– Because… at times we’ll NEED to tug on the reins

Page 20: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

BACKUP SLIDES

THANK YOU!!!!

Page 21: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Agenda• Motivation: Why is this study important?• Paper Contributions

– The Problem– The Reason

• Background• Virtual Load Store Queues (VLSQs)• A Limit Study Using VLSQs• Summary

Page 22: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Background – Replay Traps• Replay traps are hardware enforced to

– Force accesses to a particular memory location in order• Ensure CORRECT execution

• Ensure multi-processor memory consistency

– Handle different sized accesses to same address

• Replay traps are NOT related to OS trap events, i.e. no handler support is needed

• Recovering from a replay trap– Similar to handling branch mispredicts– Pipeline is flushed and execution restarts from the replay

trap causing instruction

Page 23: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

OoO Hardware – Background• Reorder Buffer (ROB), Issue Queues (Integer or

Floating Point), and Load/Store Queues

ROBIQ

FQ

LQ

SQBP

IC

LP

RN

FETCHRENAME

UNIT

SCH

HD HD

HD

HD HD

TL

TL TL

TL

TL

BP = Branch PredictorLP = Line Predictor

IC = Instruction CacheRN = Register Rename

Page 24: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

The Problem – ↑ L1 Cache Misses

• Increasing ROB size from 80 to 512– 5–40% increase in L1 cache misses when compared to ROB-80

ROB 128ROB 256ROB 512

Page 25: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

The Problem – ↑ Replay Traps• Increasing ROB size from 80 to 512

– 10–60% increase in replay trap overhead

ROB 080ROB 128ROB 256ROB 512

Page 26: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

VLSQ Performance

ROB-80 ROB-512ROB-128 ROB-256 ROB-80 ROB-512ROB-128 ROB-256

ROB-80 ROB-512ROB-128 ROB-256

Page 27: Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer

A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”

Replay Trap DistributionLEGEND: Load-Store Wrong-Size Load-Load Load-Miss Load