Download pptx - Advanced Microarchitecture

Transcript
Page 1: Advanced  Microarchitecture

Advanced MicroarchitectureLecture 10: ALUs and Bypass

Page 2: Advanced  Microarchitecture

2

This Lecture: Execution Datapath• ALUs• Scheduler to Execution Unit interface• Execution unit organization• Bypass networks• Clustering

Lecture 10: ALUs and Bypassing

Page 3: Advanced  Microarchitecture

3

ALUs• ALU: Arithmetic Logic Units• FU: Functional Units• EU: Execution Units

Lecture 10: ALUs and Bypassing

Adder ALU

What’s thedifference? ShiftAdder DivLogic Mult

“ALU”

Opcode

Result

Operand1 Operand2Implementation details, algorithms,etc. of adders, multipliers, dividers

not covered in this course

Page 4: Advanced  Microarchitecture

4

Interfacing ALUs to the Scheduler• Issue N instructions• Read N sets of

operands, immediates, opcodes, destination tags

• Route to correct functional units

Lecture 10: ALUs and Bypassing

Fetch &Dispatch

ARF PRF/ROB

Data-CaptureScheduler

FunctionalUnits

Physical register update

Bypass

Page 5: Advanced  Microarchitecture

5

Data-Capture Payload RAM

Lecture 10: ALUs and Bypassing Select Logic

opcode ValL ValR

Payload RAM

opcode ValL ValR

IssuePort 3IssuePort 0

opcode ValL ValRIssuePort 2

Select decisions,port bindings, etc.

IssueLane 0

IssueLane 1

IssueLane 2

IssueLane 3

Effectivelyone nastycrossbar

Page 6: Advanced  Microarchitecture

6

“Register File” Organization

Lecture 10: ALUs and Bypassing

“R1”

val(R1)

“R7”“R3”“R4”

val(R7)

val(R3)

val(R4)

Each RF read port input has a 1-to-1correspondence with one and only oneRF read port output

No MUXing of outputs is required

select 3select 2select 1select 0

Payload RAM

Issue 0Issue 1Issue 2Issue 3

Register File

Page 7: Advanced  Microarchitecture

7

“Register File” Is An Overkill

Lecture 10: ALUs and Bypassing Select

Select

Select

Select

SRAM Row Decoders

But how do you assign which setof data gets routed to which set

of read port outputs?

RS entries Payload RAM

Page 8: Advanced  Microarchitecture

8

Execution Lane ↔ Select Binding

Lecture 10: ALUs and Bypassing

Select

Select

Select

SelectPayload RAM readport outputs are inthe same order asthe Select Blocks

RS entries Payload RAM

Page 9: Advanced  Microarchitecture

9

Select Port 3

Select Port 2

Select Port 1

Single Entry Close-Up

Lecture 10: ALUs and Bypassing

Select Port 0

bid 0

bid 1

bid 2

bid 3

grant 0

grant 1

grant 2

grant 3

Opcode Src L Src RSingle RS Entry

One RS entry can only bid on oneselect port, so payload neverdriven to more than one port

Each select port only gives the grant to a singleRS entry, so more than one payload entry can

never drive the same payload output port

Tri-State Driver

Output buses connectedto all payload RAM entries

Page 10: Advanced  Microarchitecture

10

Src RSilo

Src LSilo

Need to “Swizzle” at the End

Lecture 10: ALUs and Bypassing

OpcodeSilo

Nasty tangle ofwires (Src’s are64-128 bits each!)

Page 11: Advanced  Microarchitecture

11

Register FileSRAM Array

Non-Data-Capture Scheduler

Lecture 10: ALUs and Bypassing Select

Select

Select

Select

RS entries Payload RAM

Register FileRow Decoders

Src Ltags

Src R tags

Page 12: Advanced  Microarchitecture

12

Immediate Values• data-capture can store immediate values in

payload bay• non-DC needs separate storage

– Could add extra field to payload– could allocate a physical register and store the

immediate there– Could store in a separate “immediate file”

Lecture 10: ALUs and Bypassing

Page 13: Advanced  Microarchitecture

13

Select 0

Select 1

Select 2

Select 3

Distributed Scheduler

• Grant/Payload read lines may have to travel further horizontally (multiple RS widths)

• ScheduleExecute latency less critical than ScheduleSchedule (wakeup-select) loop latencyLecture 10: ALUs and Bypassing

FAddFM/D

ALU1 ALU2 M/D

StoreShift

Load

FP-Ld FP-St

Payload RAM

Page 14: Advanced  Microarchitecture

14

Naive ALU Organization

• Besides making scheduling hard to scale, arbitrary any issue any ALU makes operand routing a horrible mess (needs full cross bar)

Lecture 10: ALUs and Bypassing

add shift mult div load store Fadd FMul FDiv

From Payload/RF Read Ports

Page 15: Advanced  Microarchitecture

15

Execution-Port-Based Layout

• Just need to fan-out data to FUs within the same execution lane; no cross-bar needed

• Each FU needs a “valid” input to know that the incoming data is meant for it and not another FU in the same lane– Or just let them all compute in parallel and use only the output

that you want wasted power

Lecture 10: ALUs and Bypassing

add add shift mult div store load FP ld FPCvt

Lane 0 Lane 1 Lane 2 Lane 3

Page 16: Advanced  Microarchitecture

16

Bypass Network Organization

Lecture 10: ALUs and Bypassing

add shift mult div

From Payload RAM/Register File

f × 64 bits

f × 64 bits

N × 2 sets of inputs

N=Issue Width, f=Num FUsO(f2N) area just for the bypass wiring!!!

… which is cubic since f = W(N)Previous slide had f=9 FUs, and thatdidn’t even include all of the FP units

Page 17: Advanced  Microarchitecture

17

ALU Stacks

Lecture 10: ALUs and Bypassing

add add

shift

mult

div

store load FP ld

FPCvt

FP st

Fadd

Fmul

Fdiv

From Payload/RFInteger Bypass

Floating Point BypassBypass FU Fan-OutBypass MUXes reduced to one pair per

ALU stack (as opposed to one per FU)

Page 18: Advanced  Microarchitecture

18

Bypass Sharing

Lecture 10: ALUs and Bypassing

add add

shift

mult

div

store load

FP ld

FPCvt

FP st

Fadd

Fmul

Fdiv

From Payload/RFInteger Bypass

Floating Point BypassBypass FU Fan-Out

Local FU OutputBypass wiring reduced to one output

per execution lane/ALU stack

Page 19: Advanced  Microarchitecture

19

Bypass Sharing (2)• If all FU’s in a stack have the same latency,

writeback conflicts are impossible– because only one instruction can issue to each

lane per cycle• But not all FU’s have the same latency:

Lecture 10: ALUs and Bypassing

1-cycle add, to Lane 1 S X X ES X X E1 E22-cycle shift, to Lane 1

add

shift

load

Two instructions want to writeback using same bypass path!X

Page 20: Advanced  Microarchitecture

20

Bypass Sharing (3)• How to resolve this structural hazard?

– Obvious solution: stall• Creates scheduling headaches

– Treat bypass/WB as another structural resource• Separate select logic* for bypass allocation

Lecture 10: ALUs and Bypassing

1-cycle add, to Lane 1 S ES X X

X XE1 E22-cycle shift, to Lane 1

0 1 2 3 4 5

S

Writeback Scoreboard 0 1 2 3 4 5 6X

To Bypass

To Bypass

*Not same as regular selectlogic, just a table read/write

Page 21: Advanced  Microarchitecture

21

Bypass Sharing (4)

Lecture 10: ALUs and Bypassing

SB: 1-cycle add, to Lane 1 S

S X X E1 E2A: 2-cycle shift, to Lane 1

0 1 2 3 4 5

EX XSC: 3-cycle load, to Lane 1

0 1 2 3 4 5 6

6 7

7

B

C

Select

8

8

Wasted issue opportunity:B picked by select, but cannot

issue due to WB conflictC could have issued, but is

stalled by one cycle

S E1S X X E2 E3

Page 22: Advanced  Microarchitecture

22

Bypass Critical Path

Lecture 10: ALUs and Bypassing

add add

shift

mult

div

store load

FP ld

FPCvt

FP st

Fadd

Fmul

Fdiv

Total wire length is abouttwice the total width plus

twice the total height

Page 23: Advanced  Microarchitecture

23

Bypass Critical Path (2)

Lecture 10: ALUs and Bypassing

Each executionlane/ALU stack

is self-containedadd add

shift

mult

div

store load

FP ld

FPCvt

Longest pathonly crossestotal width

once

Page 24: Advanced  Microarchitecture

24

Bypass Control Problem• We now have the datapaths to forward

values between ALUs/FUs• How do we orchestrate what goes where

and when?

• In particular, how do we set the controls of each of the bypass MUXes on a cycle-by-cycle basis?

Lecture 10: ALUs and Bypassing

Page 25: Advanced  Microarchitecture

25

Scoreboarding• For each value produced, make note (in the

scoreboard) of where it will be available• For each source, consult scoreboard to find

out how to rendezvous

Lecture 10: ALUs and Bypassing

Port 1: ADD P21 = … S X X E0 1 2 3 4 5 6 7

1

Port 0: ADD P17 = P21 + P4

21

R4

S X X E

-17 0R

add

EPort 2: MUL P30 = P21 * P17 S X X E E

mul

Page 26: Advanced  Microarchitecture

26

Scoreboarding (2)• Setting bypass controls is easy

– Read where the value will come from and feed to bypass MUXes in the operand read stage

Lecture 10: ALUs and Bypassing

Payload(src tags)

P21P4

WBScoreboard

R1

add

• May add scheduleexecute stages for data-capture scheduler– why not for non-data-

capture?

Page 27: Advanced  Microarchitecture

27

Scoreboarding (3)• Updating can be more complicated• Depends on when SB read occurs w.r.t.

operand reading– earlier reads cause more disconnect

Lecture 10: ALUs and Bypassing

S X X E1 E2 E3

S X X ES X X E

Value bypassed, WB to RFRF

Value read from RF

Assume SB read in1st cycle after schedule

ABC

A needs to update SB this cyclefor C to correctly source its operand

Page 28: Advanced  Microarchitecture

28

Scoreboarding (4)• Scoreboard can become a critical timing

bottleneck– All sources must read from scoreboard– All destinations must update scoreboard

• Once at schedule to indicate bypass location• Once later to indicate value has written back to RF

– ~ 4×N ports for the scoreboard!• If scoreboard becomes multi-cycle, things can get

really crazy– need to bypass scoreboard reads/writes like inter-group

rename bypassing

Lecture 10: ALUs and Bypassing

Page 29: Advanced  Microarchitecture

29

CAM-based Bypass• Extend data-capture concept to bypass

network

Lecture 10: ALUs and Bypassing

Register Valuefrom Payload/RFRegister Tag

= = = =

Lane 0Lane 1Lane 2Lane 3

Use Lane 0Use Lane 1Use Lane 2Use Lane 3

Use PL/RF

Result ValueResult Tag

Page 30: Advanced  Microarchitecture

30

CAM-based Bypass (2)• Must carry destination tag to execution and

broadcast along with result– But you have to do this anyway; need the

destination tag for RF writeback• A lot of CAM logic

– Costs power and area– Control is simple: it’s basically control-less

Lecture 10: ALUs and Bypassing

Page 31: Advanced  Microarchitecture

31

Writeback to Data-Caputure• Looks very similar to bypass CAM

Lecture 10: ALUs and Bypassing

Payload of DC Scheduler

=

=

=

=

=

=

=

=

SrcL SrcRValL ValRExec

Lane 3Exec

Lane 2Exec

Lane 1Exec

Lane 0

Page 32: Advanced  Microarchitecture

32

PRF Writeback Latency

Lecture 10: ALUs and Bypassing

Physical Register File(3-cycle write latency)

Bypass Network

A A

A

A: ADD P21 = …B: ADD P17 = P21 + …C: MUL P30 = P21 × P17AB B

B

C

Problem: How doesC pickup the value

of P21?

??

Page 33: Advanced  Microarchitecture

33

Multi-Level Bypass• Bypass network must cover the latency of

the writeback operation– If WB requires N cycles, then bypass must be

able to source all N cycles worth of results

Lecture 10: ALUs and Bypassing

Physical Register File

From PL/RF

AB

B

C

AC

3-level Bypass

But this is onlyfor one ALU

(or ALU stack)

Page 34: Advanced  Microarchitecture

34

Superscalar, Multi-Level Bypass

Lecture 10: ALUs and Bypassing

ALU Stack 0 ALU Stack 1 ALU Stack 2 AL

3-cycle PRF WB latency

Page 35: Advanced  Microarchitecture

35

A Bit More Hierarchical

Lecture 10: ALUs and Bypassing

ALU Stack 0 ALU Stack 1 ALU Stack 2 ALU Stack 3

To Physical Register Writeback

Page 36: Advanced  Microarchitecture

36

Bypass Network Complexity• Parameters

– N = Issue width– f = Number of functional units– b = bit width of data* (e.g., 32 bits, 64 bits)– D = Network depth (RF write latency)

• Metrics– Area– Latency… Both contribute directly to power

Lecture 10: ALUs and Bypassing

*For CAM-based bypass logic,should include tag width as well

Page 37: Advanced  Microarchitecture

37

Bypass Network Complexity (Area)• Width

– 2×(N+D) + 1 inputs at b bits each

– Replicated N times– Total 2N2b + Nb(D+1)

• Height– N values at b bits each, times D

levels– MUXes: O((D-1)×(lg N) +

lg(N+D))– Assume FUs per ALU stack is

constant: f/N = O(1)– Total O(NDb)

• Total Area– O(N3b2D + N2b2D2)– Cubic in N, Quadratic in D and b

Lecture 10: ALUs and Bypassing

N+D inputs

N values

N values

1 value O(f/N)-to-1 MUX for outputs:

O(lg(f/N)) height

N stacks

O(lg N)

O(lg N)

O(lg(N+D))ALU Stack 0

N values

Page 38: Advanced  Microarchitecture

38

Bypass Network Complexity (Delay)• ALU output to 1st latch

– O(lg(f/N)) gates for the MUX– O(N+D) wire delay horizontally– O(f/N + lg(N+D)) wire delay

vertically• Last latch to ALU input

– O(N+D) wire horizontally– O(lg N) gate delay for 1st MUX– O(N + lg N) wire delay vertically– O(lg(N+D)) gate delay

• Gate Delay (worse of the two)– O(lg(N+D)) or O(lg(f/N))

• Wire Length (ditto)– O(N + D + f/N)– Unbuffered wire has quadratic

delay

Lecture 10: ALUs and Bypassing

N+D inputs

N values

N values

1 value O(f/N)-to-1 MUX for outputs:

O(lg(f/N)) height

N stacks

O(lg N)

O(lg N)

O(lg(N+D))ALU Stack 0

N values

Page 39: Advanced  Microarchitecture

39

Bypass Network Complexity** Complexity analysis is entirely dependent on

the layout assumptions.

For example, hierarchical vs. non-hierarchical bypass organizations lead to different areas, wire lengths and gate delays

When someone says “this circuit’s area scales quadratically with respect to X”, this really means that “this circuit’s area scales quadratically with respect to X assuming a layout style of Z”

Lecture 10: ALUs and Bypassing

Page 40: Advanced  Microarchitecture

40

ALU Clustering• The exact distribution of FUs to ALU stacks and/or

select binding groups can affect layout• Already saw how separation of INT and FP stacks

reduces unnecessary datapaths– Has additional benefits when bits(INT) != bits(FP)– Ex. x86 uses 32/64-bit integers, but internally uses 80-

bit FP– SSE3 introduces 128-bit packed SIMD values, but normal

GPRs are still only 64 bits wide• Certain instructions do not generate outputs

(branches)• Memory instructions treated differently (outputs go

to LSQ), and stores don’t generate a register resultLecture 10: ALUs and Bypassing

Page 41: Advanced  Microarchitecture

41

Clustered Microarchitectures• Bypass network delays scale poorly • Scheduling delays scale poorly• RF delays scale poorly

• Partition into smaller control and data domains

Lecture 10: ALUs and Bypassing

Page 42: Advanced  Microarchitecture

42

Clustered Scheduling

Lecture 10: ALUs and Bypassing

Payload0 Payload1 Payload2 Payload3

FUs FUs FUs FUs

Cross-Cluster Wakeup Interconnection Network

RS Entries(Cluster 0)

RS Entries(Cluster 1)

RS Entries(Cluster 2)

RS Entries(Cluster 3)

ExecutionCluster 0

Cross-cluster

wakeup may take > 1

cycle

Page 43: Advanced  Microarchitecture

43

Cross-Cluster Wakeup

Cross-Cluster Wakeup Delay

Lecture 10: ALUs and Bypassing

A

B

C

D

E

Normally takes3 cycles

(assume all1-cycle latencies)

2 cluster, round-robincluster assignment

A

B

C

D

E

Now it takes 5 cycles

Cross-Cluster Wakeup

A

B

C

D

E

But a differentclustering algorithm

only needs 3!

Page 44: Advanced  Microarchitecture

44

Cross-Cluster Bypass

Lecture 10: ALUs and Bypassing

Payload0

FUs

Payload0

FUs

Payload0

FUs

Payload0

FUs

Cross-Cluster Bypass Network

Similar delay issues like the case for scheduling

Values may take > 1 cycle to get from cluster to cluster

Page 45: Advanced  Microarchitecture

45

Cross-Cluster Bypass (2)• So do we have to pay X-cluster penalties

once at schedule and again at bypass?

Lecture 10: ALUs and Bypassing

A

B

S X X X X E

S X X X X E

B schedules 2 cyclesafter A due to extra

cycle of wakeup delay

Penalties are notadditive!

This assumes that theWakeup Delay (CiCj)

is equal to theBypass Delay (CiCj)

If true for all i and j, thenbypass and wakeup delays

always overlapped

Page 46: Advanced  Microarchitecture

46

Clustered RFs• Place 1/nth of the physical registers in each

cluster– How to partition?– ARF/PRF: read at dispatch, extra latency may

require more levels of bypassing– Unified PRF: latency may make schedexec

delay intolerable (replay penalty too expensive), plus all of the bypassing

• Replicate PRF– Keep a full copy of the register file in each

cluster– Reduces per cluster read port requirements– Still need to write to all clusters (each cluster

needs full set of write ports)Lecture 10: ALUs and Bypassing


Recommended