27
ReSlice: Selective Re- execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan Zhou University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented at CARG, March 8 th , 2006 By Jason Zebchuk

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan

Embed Size (px)

Citation preview

ReSlice: Selective Re-execution of

Long-retired Misspeculated Instructions

Using Forward SlicingSmruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan Zhou

University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

Presented at CARG, March 8th, 2006By Jason Zebchuk

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 2

Data Value Speculation

Predict the value and proceed speculatively

When correct value is known, if mispediction, squash and re-execute

Initial proposals

• L1 data value

• Store/Load Data dependences

Aggressive novel proposals Retire speculative instructions

• Values of L2 misses

• Thread independence in Thread-Level Speculation (TLS)

• Speculative synchronization

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 3

Long-latency Speculation

Misprediction recovery is very wasteful Most discarded instructions are still useful

Checkpoint

Prediction

Resolution

210 Instructions6.6 Instructions

Speculatively retired instructions

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 4

Contributions (I)

ReSlice: Architecture to buffer forward slice and re-execute it on misprediction

Checkpoint

Prediction

Resolution

Buffered Forward Slice

If succeedsIf fails

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 5

Contributions (II)

A sufficient condition for correct slice re-execution and merge

• Guarantee to correctly repair the program state

Application to Thread-Level Speculation (TLS) on SpecInt

• Prediction: task live-ins

• Removed 61% task squashes

• Speedup: up to 33% over TLS (avg 12%)

• E×D2 reduction: avg 20%

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 6

Outline

Motivation and Contributions

Idea of ReSlice

Architecture Design

Experimental Results

Conclusions

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 7

Idea of ReSlice

Initial execution of the task

• Predict value of “risky” load and continue

• e.g L2 missing loads, exposed loads in TLS …

Buffer in HW the forward slice of the load

If a misprediction happens Re-execute the slice with the correct value

If succeed: merge the register and memory state and continue

• If fail: roll back and re-execute (conventional recovery)

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 8

Why is this Challenging?

…#1: LD R1 mem_A …#2: ST R2 (R1) …#3: LD R5 mem[0x20] …

“Risky” Load

R1 0x10

ST R2 mem[0x10]

R1 0x20

ST R2 mem[0x20]

LD R5 mem[0x20]

Problem: Instruction #3 is not buffered!

Buffered Slice Correct ExecutionInitial Execution

Initial execution is speculative Buffered slices may not be correct

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 9

Solution: Run-time Address Check

…#1: LD R1 mem_A …#2: ST R2 (R1) …#3: LD R5 mem[0x20] …

“Inhibiting” Store

R1 0x10

ST R2 mem[0x10]

R1 0x20

ST R2 mem[0x20]

Buffered Slice Slice Re-ExecutionInitial Execution

Different store addresses;

And mem[0x20] was loaded in initial execution

“Risky” Load

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 10

Problems

Buffered Slice

Missing Insts Extra Insts Data Corruption

Slice Re-execution

Initial Speculative Execution

(Example just shown)

“Dangling” Load “Inhibiting” Load

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 11

A Sufficient Condition

Guarantee correct

• Slice re-execution

• State merge with program

Two sub-conditions

• Branch takes same direction in both initial execution and slice re-execution

• No presence of any previous three problems

Easy for HW to check at run-time

Details please see our paper

• (Briefly: compare addresses accessed in old and new run, also look for references to same address in first run)

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 12

How does ReSlice Work?

TimelineCheckpoint

Prediction

Resolution

Slice Collection

Compare the correct and predicted value

Correct prediction

Slice Re-Execution

Misprediction

IncorrectState Merging

Correct

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 13

Architecture Design

CPU

Live Ins

Buffered Instr.

Re-Execution Unit (REU)

Slice DescriptorSlice

DescriptorSlice

Descriptors

Slice Info

Slice Info

Slice Buffer

CacheRegister State Memory

State

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 14

Other Architectural Requirements

Tag physical registers with “SliceTag”

Tag Cache for storing addresses accessed by slice instructions

Undo Log records first store to each memory location in each Slice

High-coverage Predictor to predict which instructions to use as seeds.

• Uses Dependence and Value Predictor used by TLS

Uses TLS information to identify Speculatively Read/Written locations in cache

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 15

Step 1: Slice Collection

• Fill up the Slice Buffer when a prediction is made• Track both register and memory dependences

• Save live-in operands and slice instructions

• Slices are buffered when instructions are retired

CPU

Live Ins

Buffered Instr.

Re-Execution Unit (REU)

Slice Descriptor

Slice Info

Slice Info

Slice Buffer

CacheRegister State Memory

State

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 16

An Example

Slice Desc.

Live Ins Buffered Insts

ST R3, mem[R4]R4 = 0x10

ADD R3, R1, R2

R4 = 0x10

ST R3, mem[R4]

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 17

Step 2: Slice Re-Execution

• REU takes over after a misprediction• In-order execution

• Sufficient condition is checked during slice re-execution

• If success, merge the register and memory state; otherwise, squash the task and restart

CPU

Live Ins

Buffered Instr.

Re-Execution Unit (REU)

Slice Descriptor

Slice Info

Slice Info

Slice Buffer

CacheRegister State Memory

State

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 18

Step 3: State Merging

• Copy live registers back to CPU register file

• Merge memory state (details in paper)

CPU

Live Ins

Buffered Instr.

Re-Execution Unit (REU)

Slice Descriptor

Slice Info

Slice Info

Slice Buffer

CacheRegister State Memory

State

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 19

Multiple Overlapping Slices

One slice may change live-ins of the other slice

Detect overlapping slices during slice collection

Combine the instructions of the overlapping slices and re-execute them in program order

misprediction misprediction

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 20

Outline

Motivation and Contributions

Idea of ReSlice

Architecture Design

Experimental Results

Conclusions

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 21

Methodology

Simulated CMP architecture

• 5 GHz

• Private 16k L1 per core; 1 MB Shared L2

SpecInt 2000 applications

TLS+ReSlice

Four 3-issue cores with TLS and ReSlice

Baseline: TLS

Four 3-issue cores with TLS

Serial

One 3-issue core

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 22

Slice Characterization (Averages)

10.4 instructions

4.5 Reg live-ins

1.0 Mem live-ins

2.2 Reg live-outs

1.9 Mem live-outs

1.6 slices per task

15% tasks with overlapping slices

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 23

Number of Task Squashes

0

0.2

0.4

0.6

0.8

1

bzip2 crafty gap gzip mcf parser twolf vortex vpr A.Mean

# o

f sq

ua

she

s (%

)

TLS TLS+ReSlice

61% squashes are avoided because of ReSlice

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 24

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

bzip2 crafty gap gzip mcf parser twolf vortex vpr G.Mean

Sp

ee

du

p

Serial TLS TLS+ReSlice

Speedup TLS+ReSlice

• Over TLS: up to 33% (avg 12%)

• Over Serial: avg 45%

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 25

Energy × Delay2

0

0.2

0.4

0.6

0.8

1

1.2

bzip2 crafty gap gzip mcf parser twolf vortex vpr G.Mean

Nor

mal

ized

E*D

*D

TLS TLS+ReSlice

E×D2 reduction: 20% over TLS

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 26

Conclusions

Generic architecture for forward slice re-execution

Sufficient condition for correct re-execution and merge

Improve TLS on SpecInt

• Speedups: up to 33% over TLS (avg 12%)

• E×D2 reduction: avg 20% over TLS

Recovering wasted work is a promising approach

• Boost performance

• Energy efficiency

ReSlice: Selective Re-execution of

Long-retired Misspeculated Instructions

Using Forward SlicingSmruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan Zhou

University of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu