45
Advanced Microarchitecture Lecture 8: Data-Capture Instruction Schedulers

Advanced Microarchitecture

  • Upload
    dillan

  • View
    64

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Microarchitecture. Lecture 8: Data-Capture Instruction Schedulers. Out-of-Order Execution. The goal is to execute instructions in dataflow order as opposed to the sequential order specified by the ISA - PowerPoint PPT Presentation

Citation preview

Page 1: Advanced  Microarchitecture

Advanced MicroarchitectureLecture 8: Data-Capture Instruction Schedulers

Page 2: Advanced  Microarchitecture

2

Out-of-Order Execution• The goal is to execute instructions in

dataflow order as opposed to the sequential order specified by the ISA

• The renamer plays a critical role by removing all of the false register dependencies

• The scheduler is responsible for:– for each instruction, detecting when all

dependencies have been satisifed (and therefore ready to execute)

– propagating readiness information between instructions

Lecture 8: Data-Capture Instruction Schedulers

Page 3: Advanced  Microarchitecture

3

Out-of-Order Execution (2)

Lecture 8: Data-Capture Instruction Schedulers

StaticProgram

Fetc

hDynamic

InstructionStream

Rena

me

RenamedInstruction

Stream

Sche

dule

DynamicallyScheduled

Instructions

Out-of-order =out of the originalsequential order

Page 4: Advanced  Microarchitecture

4

Superscalar != Out-of-Order

Lecture 8: Data-Capture Instruction Schedulers

A: R1 = Load 16[R2]B: R3 = R1 + R4C: R6 = Load 8[R9]D: R5 = R2 – 4E: R7 = Load 20[R5]F: R4 = R4 – 1G: BEQ R4, #0

CDE

cache miss

BCDEFG

10 cycles

BFG

7 cyclesA

B

C D

E

F

G

CD EFG

B5 cycles

B CDE F

G8 cycles

Acache miss

1-wideIn-Order

Acache miss

2-wideIn-Order

A

1-wideOut-of-Order

Acache miss

2-wideOut-of-Order

Page 5: Advanced  Microarchitecture

5

Data-Capture Scheduler• At dispatch, instruction

read all available operands from the register files and store a copy in the scheduler

• Unavailable operands will be “captured” from the functional unit outputs

• When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files

Lecture 8: Data-Capture Instruction Schedulers

Fetch &Dispatch

ARF PRF/ROB

Data-CaptureScheduler

FunctionalUnits

Physical register update

Bypass

Page 6: Advanced  Microarchitecture

6

Non-Data-Capture Scheduler

Lecture 8: Data-Capture Instruction Schedulers

Fetch &Dispatch

ARF PRF

Scheduler

FunctionalUnits

Physical registerupdate

More on thisnext lecture!

Page 7: Advanced  Microarchitecture

7

Components of a Scheduler

Lecture 8: Data-Capture Instruction Schedulers

A B

C

D

F

E

G

Buffer for unexecutedinstructions

Method for trackingstate of dependencies(resolved or not)

Arbiter B Method for choosing between multiple ready instructions competing for the same resource

Method for notificationof dependency resolution

“Scheduler Entries” or“Issue Queue” (IQ) or“Reservation Stations” (RS)

Page 8: Advanced  Microarchitecture

8

Lather, Rinse, Repeat…• Scheduling Loop or Wakeup-Select Loop

– Wake-Up Part:• Instructions selected for execution notify dependents

(children) that the dependency has been resolved• For each instruction, check whether all input

dependencies have been resolved– if so, the instruction is “woken up”

– Select Part:• Choose which instructions get to execute

– If 3 add instructions are ready at the same time, but there are only two adders, someone must wait to try again in the next cycle (and again, and again until selected)

Lecture 8: Data-Capture Instruction Schedulers

Page 9: Advanced  Microarchitecture

9

Scalar Scheduler (Issue Width = 1)

Lecture 8: Data-Capture Instruction Schedulers

T14T16

T39T6

T17T39

T15T39

==

==

==

==

T39

T8

T17

T42

Select Logic

To Execute Logic

Tag Broadcast Bus

Page 10: Advanced  Microarchitecture

10

Tags,ReadyLogic

SelectLogic

Superscalar Scheduler (detail of one entry)

Lecture 8: Data-Capture Instruction Schedulers

TagBroadcast

Buses

====

====

bid

grants

SrcL RdyLValL IssuedSrcL RdyLValL Dst

Page 11: Advanced  Microarchitecture

11

Interaction with Execution

Lecture 8: Data-Capture Instruction Schedulers

A Select Logic

SRD SL opcode ValL ValR

ValL ValR

ValL ValRValL ValR

Payload RAM

Page 12: Advanced  Microarchitecture

12

Again, But Superscalar

Lecture 8: Data-Capture Instruction Schedulers

A

B

Select Logic

SR

SR

D

D

SL

SL

opcode ValL ValR

ValL ValR

ValL ValRValL ValR

opcode ValL ValRValL ValR

ValL ValR

The schedulercaptures thedata, hence “Data-Capture”

Page 13: Advanced  Microarchitecture

13

Issue Width• Maximum number of instructions selected

for execution each cycle is the issue width– Previous slide showed an issue width of two– The slide before that showed the details of a

scheduler entry for an issue width of four• Hardware requirements:

– Typically, an Issue Width of N requires N tag broadcast buses

– Not always true: can specialize such that, for example, one “issue slot” can only handle branches

Lecture 8: Data-Capture Instruction Schedulers

Page 14: Advanced  Microarchitecture

14

Pipeline Timing

Lecture 8: Data-Capture Instruction Schedulers

Select Payload

Wakeup

A: Execute

CaptureB:

tag broadcastresultbroadcast

enablecapture

on tag matchSelect Payload Execute

WakeupC: enablecapture

tag broadcast

Cycle i Cycle i+1

A

B

C

Page 15: Advanced  Microarchitecture

15

Pipelined Timing

Lecture 8: Data-Capture Instruction Schedulers

Select PayloadA: Execute

CaptureB:

tag broadcastresultbroadcast

enablecaptureSelect Payload Execute

CaptureC:enablecapture

tag broadcast

Cycle i Cycle i+1

Select Payload Execute

Cycle i+2 Cycle i+3

Wakeup

Wakeup

A

B

CCan’t read and writepayload RAM at the

same time; may needto bypass the results

Page 16: Advanced  Microarchitecture

16

Pipelined Timing (2)

• Previous slide placed the pipeline boundary at the writing of the ready bits

• This slide shows a pipeline where latches are placed right before the tag broadcast

Lecture 8: Data-Capture Instruction Schedulers

Select PayloadA: Execute

CaptureB:

tag broadcastresultbroadcast

enablecapture

Select Payload Execute

Cycle i Cycle i+1 Cycle i+2

Wakeup

A

BWakeup

Page 17: Advanced  Microarchitecture

17

More Pipelined Timing

Lecture 8: Data-Capture Instruction Schedulers

Select PayloadA: Execute

CaptureB:tag broadcast

result broadcastand bypass

enablecapture

C:

Cycle i

WakeupSelect

Wakeup

Payload Execute

Select Payload ExecCapture

Cycle i+1 Cycle i+2 Cycle i+3

CaptureWakeuptag match

on firstoperand tag match

on secondoperand

(now C is ready)No simultaneous

read/write!

A

B

C

Need a secondlevel of bypassing

Page 18: Advanced  Microarchitecture

18

More-er Pipelined Timing

Lecture 8: Data-Capture Instruction Schedulers

Select PayloadA: Execute

CaptureC:

Cycle i

Wakeup

i+1 i+2 i+3

Select Payload Execute

Wakeup CaptureSelect Payload Ex

i+4 i+5

D:

A

C

B

D

Wakeup Capture

B: Select Select Payload Execute

A&B bothready, onlyA selected,B bids again

AC and CD mustbe bypassed, butno bypass for BD

Dependent instructionscannot execute in

back-to-back cycles!

Page 19: Advanced  Microarchitecture

19

Critical Loops• Wakeup-Select Loop cannot be trivially

pipelined while maintaining back-to-back execution of dependent instructions

Lecture 8: Data-Capture Instruction Schedulers

ABC

A

B

C

RegularScheduling

No Back-to-Back• Worst-case IPC reduction by ½

• Shouldn’t be that bad (previous slide had IPC of 4/3)

• Studies indicate 10-15% IPC penalty

• “Loose Loops Sink Chips”, Borch et al.

Page 20: Advanced  Microarchitecture

20

IPC vs. Frequency• 10-15% IPC drop doesn’t seem bad if we

can double the clock frequency

Lecture 8: Data-Capture Instruction Schedulers

1000ps 500ps500ps2.0 IPC, 1GHz 1.7 IPC, 2GHz

2 BIPS 3.4 BIPS• Frequency doesn’t double

– latch/pipeline overhead– unbalanced stages

• Other sources of IPC penalties– branches: ↑ pipe depth, ↓ predictor size, ↑ predict-to-

update latency– caches/memory: same time in seconds, ↑ frequency,

more cycles• Power limitations: more logic, higher frequency

P=½CV2f

900ps 450ps 450ps900ps 350 550

1.5GHz

Page 21: Advanced  Microarchitecture

21

Select Logic• Goal: minimize DFG height (execution time)• NP-Hard

– Precedence Constrained Scheduling Problem– Even harder because the entire DFG is not

known at scheduling time• Scheduling decisions made now may affect the

scheduling of instructions not even fetched yet• Heuristics

– For performance– For ease of implementation

Lecture 8: Data-Capture Instruction Schedulers

Page 22: Advanced  Microarchitecture

22

Simple Select Logic

Lecture 8: Data-Capture Instruction Schedulers

Scheduler Entries

1

S entriesyields O(S)gate delay

Grant0 = 1Grant1 = !Bid0Grant2 = !Bid0 & !Bid1Grant3 = !Bid0 & !Bid1 & !Bid2Grantn-1 = !Bid0 & … & !Bidn-21

x0x1x2x3x4x5x6x7x8

grant0xi = Bidi

granti

grant1grant2grant3grant4grant5grant6grant7grant8grant9

O(log S) gates

Page 23: Advanced  Microarchitecture

23

Simple Select Logic• Instructions may be located in scheduler

entries in no particular order– The first ready entry may be the oldest,

youngest, anywhere in between– Simple select results in a “random” schedule

• the schedule is still “correct” in that no dependencies are violated

• it just may be far from optimal

Lecture 8: Data-Capture Instruction Schedulers

Page 24: Advanced  Microarchitecture

24

Oldest First Select• Intuition:

– An instruction that has just entered the scheduler will likely have few, if any, dependents (only intra-group)

– Similarly, the instructions that have been in the scheduler the longest (the oldest) are likely to have the most dependents

– Selecting the oldest instructions has a higher chance of satisfying more dependencies, thus making more instructions ready to execute more parallelism

Lecture 8: Data-Capture Instruction Schedulers

Page 25: Advanced  Microarchitecture

25

Implementing Oldest First Select

Lecture 8: Data-Capture Instruction Schedulers

BA

CDEF

HWrite instructionsinto scheduler inprogram order

G

EF

HG

Compress Up

KL

EF

HG

IJ

Newlydispatched

IJ

BD

Page 26: Advanced  Microarchitecture

26

Implementing Oldest First Select (2)• Compressing buffers are very complex

– gates, wiring, area, power

Lecture 8: Data-Capture Instruction Schedulers

Ex. 4-wideNeed up toshift by 4

An entireinstruction’s

worth ofdata: tags,opcodes,

immediates,readiness, etc.

Page 27: Advanced  Microarchitecture

27

Implementing Oldest First Select (3)

Lecture 8: Data-Capture Instruction Schedulers

B

C

A

D

E

F

H

G

4

6053172

0

3

∞2

2

0

0

Age-Aware Select Logic

Grant

Page 28: Advanced  Microarchitecture

28

Handling Multi-Cycle Instructions

Lecture 8: Data-Capture Instruction Schedulers

Sched PayLd Exec

Sched PayLd Exec

Add R1 = R2 + R3

Xor R4 = R1 ^ R5

Sched PayLd Exec Add R4 = R1 + R5

Sched PayLd Exec Mul R1 = R2 × R3Exec Exec

Add attemps to execute too early!Result not ready for another two cycles.

Page 29: Advanced  Microarchitecture

29

Delayed Tag Broadcast

• It works, but…– Must make sure tag broadcast bus will be

available N cycles in the future when needed– Bypass, data-capture potentially get more

complex

Lecture 8: Data-Capture Instruction Schedulers

Sched PayLd Exec Add R4 = R1 + R5

Sched PayLd Exec Mul R1 = R2 × R3Exec Exec

Assume pipelined such that tagbroadcast occurs at cycle boundary

Page 30: Advanced  Microarchitecture

30

Delayed Tag Broadcast (2)

Lecture 8: Data-Capture Instruction Schedulers

Sched PayLd Exec Add R4 = R1 + R5

Sched PayLd Exec Mul R1 = R2 × R3Exec Exec

Sched PayLd Exec Sub R7 = R8 – #1

Sched PayLd Exec Xor R9 = R9 ^ R6

Assumeissue width

equals 2

In this cycle, three instructionsneed to broadcast their tags!

Page 31: Advanced  Microarchitecture

31

Delayed Tag Broadcast (3)• Possible solutions

1. Have one select for issuing, another select for tag broadcast• messes up timing of data-capture

2. Pre-reserve the bus• select logic more complicated, must track usage in

future cycles in addition to the current cycle3. Hold the issue slot from initial launch until tag

broadcast

Lecture 8: Data-Capture Instruction Schedulers

sch payl exec exec exec

Issue width effectively reduced by one for three cycles

Page 32: Advanced  Microarchitecture

32

Delayed Wakeup• Push the delay to the consumer

Lecture 8: Data-Capture Instruction Schedulers

=

Tag Broadcast forR1 = R2 × R3

R1

=R4

R5 = R1 + R4ready!

Tag arrives, but we waitthree cycles beforeacknowledging it

Also need to know parent’s latency

Page 33: Advanced  Microarchitecture

33

Non-Deterministic Latencies• Problem with previous approaches is that

they assume that all instruction latencies are known at the time of scheduling– Makes things uglier for the delayed broadcast– This pretty much kills the delayed wakeup

approach• Examples

– Load instructions• Latency {L1_lat, L2_lat, L3_lat, DRAM_lat}

– DRAM_lat is not a constant either, queuing delays– Some architecture specific cases

• PowerPC 603 has a “early out” for multiplication with a low-bit-width multiplicand

• Intel Core 2’s divider also has an early outLecture 8: Data-Capture Instruction Schedulers

Page 34: Advanced  Microarchitecture

34

The Wait-and-See Approach• Just wait and see whether a load hits or

misses in the cache

Lecture 8: Data-Capture Instruction Schedulers

Sched PayLd ExecR2 = R1 + #4

Sched PayLd ExecR1 = 16[$sp]

Exec Exec Cache hit known,can broadcast tag

Load-to-Use latencyincreases by 2 cycles(3 cycle load appears as 5)

Scheduler

DL1 TagsDL1Data

May be able todesign cache s.t.hit/miss knownbefore data

Exec Exec Exec

Sched PayLd ExecPenalty reduced to 1 cycle

Page 35: Advanced  Microarchitecture

35

Load-Hit Speculation• Caches work pretty well

– hit rates are high, otherwise caches wouldn’t be too useful

– Just assume all loads will hit in the cache

Lecture 8: Data-Capture Instruction Schedulers

Sched PayLd Exec R2 = R1 + #4

Sched PayLd Exec R1 = 16[$sp]Exec Exec Cache hit,data forwardedBroadcast delayed

by DL1 latency

• Um, ok, what happens when there’s a load miss?

Page 36: Advanced  Microarchitecture

36

Load Miss Scenario

Lecture 8: Data-Capture Instruction Schedulers

Sched PayLd Exec

Sched PayLd Exec Exec ExecBroadcast delayedby DL1 latency

Exec … ExecCache Miss Detected!

Value at cache output is bogusInvalidate the instruction

(ALU output ignored)Sched PayLd ExecRescheduled assuminga hit at the DL2 cache

There could be a miss at the L2, and again at the L3 cache. A single load can waste multiple issuing opportunities.

Each mis-scheduling wastes an issue slot:the tag broadcast bus, payload RAM read port, writeback/bypass bus, etc. could have been used for another instruction

Broadcast delayed by L2 latency

L2 hit

Page 37: Advanced  Microarchitecture

37

Scheduler Deallocation• Normally, as soon as an instruction issues,

it can vacate its scheduler entry– The sooner an entry is deallocated, the sooner

another instruction can reuse that entry leads to that instruction executing earlier

• In the case of a load, the load must hang around in the scheduler until it can be guaranteed that it will not have to rebroadcast its destination tag– Decreases the effective size of the scheduler

Lecture 8: Data-Capture Instruction Schedulers

Page 38: Advanced  Microarchitecture

38

“But wait, there’s more!”

Lecture 8: Data-Capture Instruction Schedulers

Sched PayLd Exec

Sched PayLd Exec Exec ExecDL1 Miss

Sched PayLd ExecSched PayLd ExecSched PayLd Exec

Squash

Not only children getsquashed, there may be grand-children to squash as well

All waste issue slotsAll must be rescheduledAll waste powerNone may leave scheduler until load hit known

Sched PayLd ExSched PayLd ExSched PayLd Ex

Sched PayLd Exec

Page 39: Advanced  Microarchitecture

39

Squashing• The number of cycles worth of dependents

that must be squashed is equal to Cache-Miss-Known latency minus one– previous example, miss-known latency = 3

cycles, there are two cycles worth of mis-scheduled dependents

– Early miss-detection helps reduce mis-scheduled instructions

– A load may have many children, but the issue width limits how many can possibly be mis-scheduled• Max = Issue-width × (Miss-known-lat – 1)

Lecture 8: Data-Capture Instruction Schedulers

Page 40: Advanced  Microarchitecture

40

Squashing (2)• Simple: squash anything “in-flight”

between schedule and execute

Lecture 8: Data-Capture Instruction Schedulers

SchedPayLd Exec Exec ExecSchedPayLd Exec

SchedPayLd Exec

SchedPayLd ExecSchedPayLd Exec

SchedPayLd ExecSchedPayLd Exec

SchedPayLd Exec

• This may include non-dependent instructions

• All instructions must stay in the scheduler for a few extra cycles to make sure they will not be rescheduled due to a squash

Page 41: Advanced  Microarchitecture

41

Squashing (3)• Selective squashing: use “load colors”

– each load is assigned a unique color– every dependent “inherits” its parents’ colors– on a load miss, the load broadcasts its color and

anyone in the same color group gets squashed• An instruction may end up with many

colors

Lecture 8: Data-Capture Instruction Schedulers

… …Explicitly tracking each color wouldrequire a huge number of comparisons

Page 42: Advanced  Microarchitecture

42

Squashing (4)• Can list “colors” in unary (bit-vector) form

Lecture 8: Data-Capture Instruction Schedulers

Load R1 = 16[R2]1 0 0 0 0 0 0 0

Add R3 = R1 + R41 0 0 0 0 0 0 0

Load R5 = 12[R7]1 0 0 0 0 0 00

Load R8 = 0[R1]1 0 1 0 0 0 0 0

Load R7 = 8[R4]100 0 0 0 00

Add R6 = R8 + R71 0 1 0 0 0 01

• Each instruction’s color vector is the bitwise OR of its parents’ vectors

• A load miss now only squashes the dependent instructions

• Hardware cost increases quadratically with number of load instructions

X

X

X

Page 43: Advanced  Microarchitecture

43

Allocation• Allocate in-order,

Deallocate in-order• Very simple!• Smaller effective

scheduler size– instructions may have

already executed out-of-order, but their RS entries cannot be reused

– Can be very bad if a load goes to main memory

Lecture 8: Data-Capture Instruction Schedulers

Circular Buffer

Head

Tail

Tail

Page 44: Advanced  Microarchitecture

44

Allocation (2)• With arbitrary

placement, entries much better utilized

• Allocator more complex– must scan availability and

find N free entries• Write logic more

complex– must route N instructions

to arbitrary entries of the scheduler

Lecture 8: Data-Capture Instruction Schedulers

RS Allocator

Entry availabilitybit-vector

0000100100000111

Page 45: Advanced  Microarchitecture

45

Allocation (3)• Segment the entries

– only one entry per segment may be allocated per cycle

– instead of 4-of-16 alloc (previous slide), each allocator only does 1-of-4

– write logic simplified as well

• Still possible inefficiencies– full segment leads to

allocating less than N instructions per cycleLecture 8: Data-Capture Instruction Schedulers

0010101000010110

AllocAlloc

AllocAlloc

A

B

C

DX

Free RS entries exist, justnot in the correct segment

0