Advanced Microarchitecture

Advanced MicroarchitectureLecture 6: Superscalar Decode and Other Pipelining

2

RISC ISA Format• This should be review…

– Fixed-length• MIPS all insts are 32-bits/4 bytes

– Few formats• MIPS has 3: R-, I-, J- formats• Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP

– Regularity across formats (when possible/practical)• MIPS, Alpha opcode in same bit-position for all formats• MIPS rs & rt fields in same bit-position for I- and J-

formats• Alpha ra/fa field in same bit-position for all 5 formats

Lecture 6: Superscalar Decode and Other Pipelining

3

RISC Decode (MIPS)

000 001 010 011 100 101 110 111000 func rt j jal beq bne blez bgtz001 addi addi

u slti sltiu andi ori xori lui010 rs rs rs rs011100 lb lh lwl lw lbu lhu lwr101 sb sh swl sw swr110 lwc0 lwc1 lwc2 lwc3111 swc0 swc1 swc2 swc3


opcode6

other21

func5

R-format only

opco

de[5

,3]

opcode[2,0]

001xxx = Immediate

1xxxxx = Memory(1x0: LD, 1x1: ST)

000xxx = Br/Jump(except for 000000)

6

Superscalar Decode for RISC ISAs• To sustain X instructions per cycle, must

decode X instructions per cycle– Just duplicate the hardware


32-bit inst

Decoder

decodedinst

scalar

Decoder Decoder Decoder

32-bit inst

Decoder

decodedinst

superscalar

4-wide superscalar fetch

32-bit inst32-bit inst32-bit inst

decodedinst

decodedinst

decodedinst

1-Fetch

7

VLIW/EPIC ISAs• Compiler finds the parallelism, packs

multiple instructions into a “very long” instruction


AddLoad

BranchSub

StoreXorRISC

IMB Add Load Branch

IMI Sub Store Xor

VLIW (EPIC-like)

“template” bits

8

VLIW Decoder• Similar to superscalar RISC decoder


Decoder

inst0

TemplateDecoder

tmplt

Decoder Decoder

inst1 inst2

decodedinst

decodedinst

decodedinst

9

CISC ISA• RISC focus on fast access to information

– easy decode, I$, large RF’s, D$• CISCs are older

– designed in era with fewer transistors, chips– each memory access very expensive

• pack as much work into as few bytes as possible• more “expressive” instructions

– compare to simple RISC insts– better potential code generation– more complex code generation in practice


10

Example: VAX• Superset of ISAs, incl. IBM360, DEC PDP-11• VAX = “Virtual Address Extension”• 16 32-bit registers• 32-bit memory addressing• Encoding:

– 1-2 byte opcode, followed by 0-6 operand specifiers, each of which may be up to 5 bytes

– Opcode implies datatype, size, # operands– Orthogonality: any opcode with any addressing

mode


11

VAX operand addressingMode Example MeaningRegister Add R4, R3 R4 = R4 + R3Immediate Add R4, #3 R4 = R4 + 3Displacement Add R4, 100(R1) R4 = R4 + Mem[100+R1]Register Indirect

Add R4, (R1) R4 = R4 + Mem[R1]

Indexed/Base Add R3, (R1+R2) R3 = R3 + Mem[R1+R2]Direct/Absolute Add R1, (1234) R1 = R1 + Mem[1234]Memory Indirect

Add R1, @(R3) R1 = R1 + Mem[Mem[R3]]

Auto-Increment Add R1,(R2)+ R1 = R1 + Mem[R2]; R2++

Auto-Decrement

Add R1, -(R2) R2--; R1 = R1 + Mem[R2]Lecture 6: Superscalar Decode and Other Pipelining

Any mode could be applied to any instruction!

12

x86• CISC, stemming from the original 4004• Example: “Move” instructions

1. General Purpose data movement– RR, MR, RM, IR, IM

2. Exchanges– EAX ↔ ECX, byte order within a register

3. Stack Manipulation– push pop R ↔ Stack, PUSHA/POPA

4. Type Conversion5. Conditional Moves


Many ways to do the same/similar operation

13

x86 Encoding• Basic x86 Instruction:


Prefixes0-4 bytes

Opcode1-2 bytes

Mod R/M0-1 bytes

SIB0-1 bytes

Displacement0/1/2/4 bytes

Immediate0/1/2/4 bytes

Longest Inst 15 bytesShortest Inst: 1 byte

• Opcode specifies operation, and if the Mod R/M byte is used– Most instructions use the Mod R/M byte– Mod R/M specifies if optional SIB byte is

used– Mod R/M and SIB may specify additional

constants

14

Mod R/M Byte

• Mode = 00: No-displacement, use Mem[ regmmm ]• Mode = 01: 8-bit displacement, Mem[ regmmm +

SExt(disp) ]• Mode = 10: 32-bit displacement (similar to

previous)• Mode = 11: Register-to-Register, use regmmm


M M r r r m m m

Mode Register R/M

1 of 8 registers

Add EBX, ECX 11 011 001Mod R/M

Add EBX, [ECX] 00 011 001Mod R/M

15

Exceptions• Mod=00, R/M = 5 get operand from 32-

bit immediate– Add EDX =

EDX+Mem[0cff1234]• Mod=00, 01 or 10, R/M = 4 use the “SIB”

byte– SIB = Scale/Index/Base


000101010cff1234

00010100Mod R/M

ss iii bbbSIB

bbb 5: use regbbbbbb = 5: use 32-bit imm(Mod = 00 only)

iii 4: use siiii = 4: use 0

si = regiii << ss

16

Opcode Confusion• There are different opcodes for AB and

BA


10001011 11000011 MOV EAX, EBX

10001001 11000011 MOV EBX, EAX

10001001 11011000 MOV EAX, EBX

• If Opcode = 0F, then use next byte as opcode

• If Opcode = D8-DF, then FP instruction

11011000 11 R/MFP opcode

17

x86 Decode Example


11000111

MOV regimm (use Mod R/M, 32-bit Imm to follow)

opcode10000100

Mod=2 (use 32-bit Disp)R/M = 4 (use SIB)

reg ignored

Mod R/M11000011

SIB

ss=3 Scale by 8use EAX, EBX

Disp Imm

*( (EAX<<3) + EBX + Disp ) = ImmTotal: 11 byte instruction

Note: Add 4 prefixes, andyou reach the max size

18

In RISC (MIPS)1. lui R1 = Disp[31:16]2. ori R1 = R1,

Disp[15:0]3. add R1 = R1 + R24. shli R3 = R3 << 35. add R3 = R3 + R16. lui R1 = Imm[31:16]7. ori R1 = R1,

Imm[15:0]8. st [R3] R1


8 instructions, 32 bits each32 bytes total

2.9x Bigger!

19

x86-64 / EM64T• 816 general purpose registers

– only 3-bit register fields?• Registers extended from 3264 bits each• Default: instructions still 32-bit

– New “REX” prefix byte to specify additional information


0100 m R I B opcode

m=0 64-bit modem=1 32-bit mode

md rrr mrm

R rrr

ss iii bbb

iiiI bbbB

REX

Register specifiers are now 4 bitseach: can choose 1 of 16 registers

20

64-bit Extensions to IA32


IA32+64-bit exts IA32

CPU architect

(Taken from Bob Colwell’s Eckert-MauchlyAward Talk, ISCA 2005)

Ugly? Scary?… but it works

21

x86 Decode Hardware


PrefixDecoder Left Shift

Num Prefixes

opcodedecoder

2nd opcodedecoder

Mod R/Mdecoder

SIBdecoder

Left Shift Left Shift

+ +

Instruction bytes

22

Decoded x86 Format• RISC: easy to expand union of needed

info– generalized opcode (not too hard)– reg1, reg2, reg3, immediate (possibly extended)– some fields ignored

• CISC: union of all possible info is huge– generalized opcode (too many options)– up to 3 regs, 2 immediates– segment information– “rep” specifiers

• would lead to 100’s of bits• common case only needs a fraction a lot of waste


23

x86 RISC-like mops• Each x86 instruction decoded into a variable

number of “uops” (micro-ops - Intel) or ROPs (RISC ops - AMD)– Each uop is RISC-like– Uops have limitations to keep union of info practical


ADD EAX, EBX ADD EAX, EBX 1 uop

ADD EAX, [EBX] Load tmp = [EBX]ADD EAX, tmp 2 uops

ADD [EAX], EBX Load tmp = [EAX]ADD tmp, EBX

STA EAXSTD tmp

4 uops

24

uop Limits• How many uops can a decoder generate?

– For complex x86 insts, many are needed (10’s, 100’s?)

– Makes decoder horribly complex– Typically there’s a limit to keep complexity

under control• One x86 instruction 1-4 uops• Most instructions translate to 1.5-2.0 uops

• Ok, what happens if a complex instruction needs more than 4 uops?


25

UROM/MS for Complex x86 Insts• UROM (mcode-ROM) stores the uop

equivalents for nasty x86 instructions– “Nasty” could be large/complex (> 4 uops like

PUSHA or STRREP.MOV) or obsolete instructions (AAA)

• Microsequencer (MS) is the control logic that interfaces between the post-decode pipestages, the UROM, the decoders and the PC-generation


26

UROM/MS Example (3 uop-wide)


ADDSTORE

SUB

Cycle 1

ADDSTASTD

SUB

Cycle 2

REP.MOVADD [ ]

Fetch- x86 instsDecode - uopsUROM - uops

SUBLOADSTORE

REP.MOVADD [ ]

XOR

Cycle 3

INCmJCCLOAD

Cycle 4

STOREINC

mJCC

Cycle 5

…

Cycle …

……

mJCCREP.MOVADD [ ]

XORLOAD

Cycle n Cycle n+1

ADD

Complex instructions, getuops from mcode sequencer

27

Superscalar CISC Decode1. Instruction Length Decode (ILD)

• Where are the instructions?• Limited decode – just enough to parse prefixes,

modes2. Shift/Alignment

• Get the right bytes to the decoders3. Decode

• Crack into uops


And then do this for N instructions per cycle!

28

ILD Recurrence/Loop• PCi = X• PCi+1= PCi + sizeof( Mem[PCi] )• PCi+2= PCi+1 + sizeof( Mem[PCi+1] )

= PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] )

• Can’t find start of next instruction without decoding the first

• Critical loop not pipelineable– ILD of 4 instructions per cycle imples that clock cycle

time will be 4 x latency(ILD)


29

Decode Implementation


Left Shifter

Decoder 3

Cycle

3

Decoder 2Decoder 1ILD dominatescycle time; not

scalable

Instruction Bytes (ex. 16 bytes)ILD (limited decode)

Length 1

Length 2

+Cycle

1

Inst 1 Inst 2 Inst 3 Remainder

Cycle

2

+

Length 3

bytesdecoded

ILD (limited decode)ILD (limited decode)

30

Hardware-Intensive Decode


Decode from everypossible instructionstarting point!

Giant MUXes toselect instructionbytes

ILDDecoder

ILDILDILDILDILDILDILDILDILDILDILDILDILDILDILD

DecoderDecoder

31

ILD in Hardware-Intensive Approach


6 bytes

4 bytes

3 bytes+

Total bytes decode = 11Previous: 3 ILD + 2add

Now: 1ILD + 2(mux+add)

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

32

Predecoding• ILD loop is hardware intensive, impacts

latency, and can consume substantial power

• Observation: when instructions A, B and C are decoded into lengths 3, 5 and 1, the next time we encounter A, B and C, their lengths will still be the same!– cache the ILD work– do once, reuse many times


33

Decoder Example: AMD K5


Predecode Logic

From Memoryb0 b1 b2 … b78 bytes

I$

b0 b1 b2 … b7

8 bits

+5 bits13 bytes

Decode16 (8-bit inst + 5-bit predecode)

8 (8-bit inst + 5-bit predecode)

Up to 4 ROPs

34

Decoder Example: AMD K5• Predecode information makes decode

easier– Instruction start/end location (ILD)– Number of ROPs needed per inst– Opcode and prefix locations

• Power/performance tradeoffs– Larger I$ (increase data by 62.5%)

• Longer I$ latency, More I$ power consumption– Remove logic from decode

• Shorter branch mispred penalty, simpler logic• Cache and reused decode work less decode power

– Longer effective IL1 miss latency


35

Limits on Decode• Max branches (color allocation)• Taken branches• Incomplete instructions

– x86 insts are not aligned, may span two cache lines

– can’t decode until both halves have been fetched

• Instruction complexity– decoding “complex” (2-4 uop) instructions

requires a more complex decoder; expensive to replicate

– compromise: fewer complex decoders plus simpler decoders for instructions with single-uop mappings


36

Decoder Example: Intel P-Pro


16 Raw Instruction Bytes

Decoder0

Decoder1

Decoder2

BranchAddress

Calculator

Fetch resteerif needed

mROM

4 uops 1 uop 1 uop

If instruction in Decoder 1 or 2 requires > 1 uop, do not generateany output, and then shift to Decoder to the left on next cycle

Only Decoder 0 can interface with the uROM and MS

37

Decoder Example: Intel P4


L2 CacheRaw instruction bytes

4-uop DecoderuROM

trace const. buffer

Trace Cache

Decode at mostone inst per cycle

Fetch up to 3 uopsper cycleP4 has a strangled front-end, at best it

can only deliver 3 uops per cycle; contrast to P-Pro that can deliver up to 6 uops per cycle (if they’re 4/1/1)

More on this when westudy the P4 in detail

38

Pipeline Control


Dispatch ROB, LSQ, RSfull, stall front-end

Disp

ROB, LSQ, RSfull, stall front-end

RenDecDecRotILDI$I$BPred

Except not everyone stalls…

This logic starting toget pretty intense

DecodeFetch

39

Just because there’s a stall conditionsomewhere does not imply that

everybody has to stall

Non-Uniform Stall


Disp

nops due to I$ miss

ROBFull!

not fullfull

fullfullfull

RenDecDecRotILDI$I$BP

40

Compressing/Serpentine Pipelines


DispROBFull!

1 entryfree

RenDecDecRotILDI$I$BPBetter “flow”, but much more

complex since need to track howmany insts can advance per stage

41

Lots o’ Stalls• I$ miss, ITLB miss• Decoder limitations

– x86 4-1-1 limit– branch limits (max/cycle, max taken/cycle)

• Renamer – out of physical registers


42

Smaller Control Domains• Separate long pipeline into multiple smaller

pipelines


BPred I$ I$ Dec Dec Dec Ren Alloc Sched

BPred I$ I$ Dec Dec Dec

Ren Alloc Sched

43

Smaller Control Domains (2)

• Non-decoupled pipe needed logic to simultaneously control ~10 stages

• Decoupled pipe needs multiple control logic circuits– each only needs to interact with ~5 stages (~3

real stages, plus the queue ahead and behind)


Pipeline ControlLogic

Non-decoupledI$ Dec Dec Dec Ren

Pipeline ControlLogic

Decoupled

XX

No direct control logicfor stages outside of

local pipeline

44

Smaller Control Domains (3)• Queues can effectively add more pipeline

stages


cycleboundary

previous stage next stage

Inter-pipe queueenqueue

logic latch dequeuelogic

previous stage next stage

• Avoid this by writing and reading in the same cycle (affects timing, complexity)

45

Queues provide Smoothing• Approximation to serpentine pipes

(compress only at certain locations – i.e., the queues)

• Different levels of decoupling possible depending on frequency target, power, complexity tolerance


BPred I$ I$ Dec Ren Sched

The “SimpleScalar” pipeline

Note: RS is effectivelya queue (more later)

46

Different Clocking Domains• Decoupling the pipe allows each segment

to operate independently (local control)• Also means each can run at different

speeds (P4)


TC TC Dec … Alloc

Sched … WB(ROB)

Commit(IAQ)

(uopQ)1x freq

(3 uops/Mclk)½x freq(6 uops/2Mclk’s 3 uops/Mclk)

2x freq(2 uops/Fclk 4 uops/Mclk)

1x freq(3 uops/Mclk)

Documents

Advanced Microarchitecture