44
Advanced Microarchitecture Lecture 6: Superscalar Decode and Other Pipelining

Advanced Microarchitecture

  • Upload
    vince

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Microarchitecture. Lecture 6: Superscalar Decode and Other Pipelining. RISC ISA Format. This should be review… Fixed-length MIPS all insts are 32-bits/4 bytes Few formats MIPS has 3: R-, I-, J- formats Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP - PowerPoint PPT Presentation

Citation preview

Page 1: Advanced  Microarchitecture

Advanced MicroarchitectureLecture 6: Superscalar Decode and Other Pipelining

Page 2: Advanced  Microarchitecture

2

RISC ISA Format• This should be review…

– Fixed-length• MIPS all insts are 32-bits/4 bytes

– Few formats• MIPS has 3: R-, I-, J- formats• Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP

– Regularity across formats (when possible/practical)• MIPS, Alpha opcode in same bit-position for all formats• MIPS rs & rt fields in same bit-position for I- and J-

formats• Alpha ra/fa field in same bit-position for all 5 formats

Lecture 6: Superscalar Decode and Other Pipelining

Page 3: Advanced  Microarchitecture

3

RISC Decode (MIPS)

000 001 010 011 100 101 110 111000 func rt j jal beq bne blez bgtz001 addi addi

u slti sltiu andi ori xori lui010 rs rs rs rs011100 lb lh lwl lw lbu lhu lwr101 sb sh swl sw swr110 lwc0 lwc1 lwc2 lwc3111 swc0 swc1 swc2 swc3

Lecture 6: Superscalar Decode and Other Pipelining

opcode6

other21

func5

R-format only

opco

de[5

,3]

opcode[2,0]

001xxx = Immediate

1xxxxx = Memory(1x0: LD, 1x1: ST)

000xxx = Br/Jump(except for 000000)

Page 4: Advanced  Microarchitecture

6

Superscalar Decode for RISC ISAs• To sustain X instructions per cycle, must

decode X instructions per cycle– Just duplicate the hardware

Lecture 6: Superscalar Decode and Other Pipelining

32-bit inst

Decoder

decodedinst

scalar

Decoder Decoder Decoder

32-bit inst

Decoder

decodedinst

superscalar

4-wide superscalar fetch

32-bit inst32-bit inst32-bit inst

decodedinst

decodedinst

decodedinst

1-Fetch

Page 5: Advanced  Microarchitecture

7

VLIW/EPIC ISAs• Compiler finds the parallelism, packs

multiple instructions into a “very long” instruction

Lecture 6: Superscalar Decode and Other Pipelining

AddLoad

BranchSub

StoreXorRISC

IMB Add Load Branch

IMI Sub Store Xor

VLIW (EPIC-like)

“template” bits

Page 6: Advanced  Microarchitecture

8

VLIW Decoder• Similar to superscalar RISC decoder

Lecture 6: Superscalar Decode and Other Pipelining

Decoder

inst0

TemplateDecoder

tmplt

Decoder Decoder

inst1 inst2

decodedinst

decodedinst

decodedinst

Page 7: Advanced  Microarchitecture

9

CISC ISA• RISC focus on fast access to information

– easy decode, I$, large RF’s, D$• CISCs are older

– designed in era with fewer transistors, chips– each memory access very expensive

• pack as much work into as few bytes as possible• more “expressive” instructions

– compare to simple RISC insts– better potential code generation– more complex code generation in practice

Lecture 6: Superscalar Decode and Other Pipelining

Page 8: Advanced  Microarchitecture

10

Example: VAX• Superset of ISAs, incl. IBM360, DEC PDP-11• VAX = “Virtual Address Extension”• 16 32-bit registers• 32-bit memory addressing• Encoding:

– 1-2 byte opcode, followed by 0-6 operand specifiers, each of which may be up to 5 bytes

– Opcode implies datatype, size, # operands– Orthogonality: any opcode with any addressing

mode

Lecture 6: Superscalar Decode and Other Pipelining

Page 9: Advanced  Microarchitecture

11

VAX operand addressingMode Example MeaningRegister Add R4, R3 R4 = R4 + R3Immediate Add R4, #3 R4 = R4 + 3Displacement Add R4, 100(R1) R4 = R4 + Mem[100+R1]Register Indirect

Add R4, (R1) R4 = R4 + Mem[R1]

Indexed/Base Add R3, (R1+R2) R3 = R3 + Mem[R1+R2]Direct/Absolute Add R1, (1234) R1 = R1 + Mem[1234]Memory Indirect

Add R1, @(R3) R1 = R1 + Mem[Mem[R3]]

Auto-Increment Add R1,(R2)+ R1 = R1 + Mem[R2]; R2++

Auto-Decrement

Add R1, -(R2) R2--; R1 = R1 + Mem[R2]Lecture 6: Superscalar Decode and Other Pipelining

Any mode could be applied to any instruction!

Page 10: Advanced  Microarchitecture

12

x86• CISC, stemming from the original 4004• Example: “Move” instructions

1. General Purpose data movement– RR, MR, RM, IR, IM

2. Exchanges– EAX ↔ ECX, byte order within a register

3. Stack Manipulation– push pop R ↔ Stack, PUSHA/POPA

4. Type Conversion5. Conditional Moves

Lecture 6: Superscalar Decode and Other Pipelining

Many ways to do the same/similar operation

Page 11: Advanced  Microarchitecture

13

x86 Encoding• Basic x86 Instruction:

Lecture 6: Superscalar Decode and Other Pipelining

Prefixes0-4 bytes

Opcode1-2 bytes

Mod R/M0-1 bytes

SIB0-1 bytes

Displacement0/1/2/4 bytes

Immediate0/1/2/4 bytes

Longest Inst 15 bytesShortest Inst: 1 byte

• Opcode specifies operation, and if the Mod R/M byte is used– Most instructions use the Mod R/M byte– Mod R/M specifies if optional SIB byte is

used– Mod R/M and SIB may specify additional

constants

Page 12: Advanced  Microarchitecture

14

Mod R/M Byte

• Mode = 00: No-displacement, use Mem[ regmmm ]• Mode = 01: 8-bit displacement, Mem[ regmmm +

SExt(disp) ]• Mode = 10: 32-bit displacement (similar to

previous)• Mode = 11: Register-to-Register, use regmmm

Lecture 6: Superscalar Decode and Other Pipelining

M M r r r m m m

Mode Register R/M

1 of 8 registers

Add EBX, ECX 11 011 001Mod R/M

Add EBX, [ECX] 00 011 001Mod R/M

Page 13: Advanced  Microarchitecture

15

Exceptions• Mod=00, R/M = 5 get operand from 32-

bit immediate– Add EDX =

EDX+Mem[0cff1234]• Mod=00, 01 or 10, R/M = 4 use the “SIB”

byte– SIB = Scale/Index/Base

Lecture 6: Superscalar Decode and Other Pipelining

000101010cff1234

00010100Mod R/M

ss iii bbbSIB

bbb 5: use regbbbbbb = 5: use 32-bit imm(Mod = 00 only)

iii 4: use siiii = 4: use 0

si = regiii << ss

Page 14: Advanced  Microarchitecture

16

Opcode Confusion• There are different opcodes for AB and

BA

Lecture 6: Superscalar Decode and Other Pipelining

10001011 11000011 MOV EAX, EBX

10001001 11000011 MOV EBX, EAX

10001001 11011000 MOV EAX, EBX

• If Opcode = 0F, then use next byte as opcode

• If Opcode = D8-DF, then FP instruction

11011000 11 R/MFP opcode

Page 15: Advanced  Microarchitecture

17

x86 Decode Example

Lecture 6: Superscalar Decode and Other Pipelining

11000111

MOV regimm (use Mod R/M, 32-bit Imm to follow)

opcode10000100

Mod=2 (use 32-bit Disp)R/M = 4 (use SIB)

reg ignored

Mod R/M11000011

SIB

ss=3 Scale by 8use EAX, EBX

Disp Imm

*( (EAX<<3) + EBX + Disp ) = ImmTotal: 11 byte instruction

Note: Add 4 prefixes, andyou reach the max size

Page 16: Advanced  Microarchitecture

18

In RISC (MIPS)1. lui R1 = Disp[31:16]2. ori R1 = R1,

Disp[15:0]3. add R1 = R1 + R24. shli R3 = R3 << 35. add R3 = R3 + R16. lui R1 = Imm[31:16]7. ori R1 = R1,

Imm[15:0]8. st [R3] R1

Lecture 6: Superscalar Decode and Other Pipelining

8 instructions, 32 bits each32 bytes total

2.9x Bigger!

Page 17: Advanced  Microarchitecture

19

x86-64 / EM64T• 816 general purpose registers

– only 3-bit register fields?• Registers extended from 3264 bits each• Default: instructions still 32-bit

– New “REX” prefix byte to specify additional information

Lecture 6: Superscalar Decode and Other Pipelining

0100 m R I B opcode

m=0 64-bit modem=1 32-bit mode

md rrr mrm

R rrr

ss iii bbb

iiiI bbbB

REX

Register specifiers are now 4 bitseach: can choose 1 of 16 registers

Page 18: Advanced  Microarchitecture

20

64-bit Extensions to IA32

Lecture 6: Superscalar Decode and Other Pipelining

IA32+64-bit exts IA32

CPU architect

(Taken from Bob Colwell’s Eckert-MauchlyAward Talk, ISCA 2005)

Ugly? Scary?… but it works

Page 19: Advanced  Microarchitecture

21

x86 Decode Hardware

Lecture 6: Superscalar Decode and Other Pipelining

PrefixDecoder Left Shift

Num Prefixes

opcodedecoder

2nd opcodedecoder

Mod R/Mdecoder

SIBdecoder

Left Shift Left Shift

+ +

Instruction bytes

Page 20: Advanced  Microarchitecture

22

Decoded x86 Format• RISC: easy to expand union of needed

info– generalized opcode (not too hard)– reg1, reg2, reg3, immediate (possibly extended)– some fields ignored

• CISC: union of all possible info is huge– generalized opcode (too many options)– up to 3 regs, 2 immediates– segment information– “rep” specifiers

• would lead to 100’s of bits• common case only needs a fraction a lot of waste

Lecture 6: Superscalar Decode and Other Pipelining

Page 21: Advanced  Microarchitecture

23

x86 RISC-like mops• Each x86 instruction decoded into a variable

number of “uops” (micro-ops - Intel) or ROPs (RISC ops - AMD)– Each uop is RISC-like– Uops have limitations to keep union of info practical

Lecture 6: Superscalar Decode and Other Pipelining

ADD EAX, EBX ADD EAX, EBX 1 uop

ADD EAX, [EBX] Load tmp = [EBX]ADD EAX, tmp 2 uops

ADD [EAX], EBX Load tmp = [EAX]ADD tmp, EBX

STA EAXSTD tmp

4 uops

Page 22: Advanced  Microarchitecture

24

uop Limits• How many uops can a decoder generate?

– For complex x86 insts, many are needed (10’s, 100’s?)

– Makes decoder horribly complex– Typically there’s a limit to keep complexity

under control• One x86 instruction 1-4 uops• Most instructions translate to 1.5-2.0 uops

• Ok, what happens if a complex instruction needs more than 4 uops?

Lecture 6: Superscalar Decode and Other Pipelining

Page 23: Advanced  Microarchitecture

25

UROM/MS for Complex x86 Insts• UROM (mcode-ROM) stores the uop

equivalents for nasty x86 instructions– “Nasty” could be large/complex (> 4 uops like

PUSHA or STRREP.MOV) or obsolete instructions (AAA)

• Microsequencer (MS) is the control logic that interfaces between the post-decode pipestages, the UROM, the decoders and the PC-generation

Lecture 6: Superscalar Decode and Other Pipelining

Page 24: Advanced  Microarchitecture

26

UROM/MS Example (3 uop-wide)

Lecture 6: Superscalar Decode and Other Pipelining

ADDSTORE

SUB

Cycle 1

ADDSTASTD

SUB

Cycle 2

REP.MOVADD [ ]

Fetch- x86 instsDecode - uopsUROM - uops

SUBLOADSTORE

REP.MOVADD [ ]

XOR

Cycle 3

INCmJCCLOAD

Cycle 4

STOREINC

mJCC

Cycle 5

Cycle …

……

mJCCREP.MOVADD [ ]

XORLOAD

Cycle n Cycle n+1

ADD

Complex instructions, getuops from mcode sequencer

Page 25: Advanced  Microarchitecture

27

Superscalar CISC Decode1. Instruction Length Decode (ILD)

• Where are the instructions?• Limited decode – just enough to parse prefixes,

modes2. Shift/Alignment

• Get the right bytes to the decoders3. Decode

• Crack into uops

Lecture 6: Superscalar Decode and Other Pipelining

And then do this for N instructions per cycle!

Page 26: Advanced  Microarchitecture

28

ILD Recurrence/Loop• PCi = X• PCi+1= PCi + sizeof( Mem[PCi] )• PCi+2= PCi+1 + sizeof( Mem[PCi+1] )

= PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] )

• Can’t find start of next instruction without decoding the first

• Critical loop not pipelineable– ILD of 4 instructions per cycle imples that clock cycle

time will be 4 x latency(ILD)

Lecture 6: Superscalar Decode and Other Pipelining

Page 27: Advanced  Microarchitecture

29

Decode Implementation

Lecture 6: Superscalar Decode and Other Pipelining

Left Shifter

Decoder 3

Cycle

3

Decoder 2Decoder 1ILD dominatescycle time; not

scalable

Instruction Bytes (ex. 16 bytes)ILD (limited decode)

Length 1

Length 2

+Cycle

1

Inst 1 Inst 2 Inst 3 Remainder

Cycle

2

+

Length 3

bytesdecoded

ILD (limited decode)ILD (limited decode)

Page 28: Advanced  Microarchitecture

30

Hardware-Intensive Decode

Lecture 6: Superscalar Decode and Other Pipelining

Decode from everypossible instructionstarting point!

Giant MUXes toselect instructionbytes

ILDDecoder

ILDILDILDILDILDILDILDILDILDILDILDILDILDILDILD

DecoderDecoder

Page 29: Advanced  Microarchitecture

31

ILD in Hardware-Intensive Approach

Lecture 6: Superscalar Decode and Other Pipelining

6 bytes

4 bytes

3 bytes+

Total bytes decode = 11Previous: 3 ILD + 2add

Now: 1ILD + 2(mux+add)

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Page 30: Advanced  Microarchitecture

32

Predecoding• ILD loop is hardware intensive, impacts

latency, and can consume substantial power

• Observation: when instructions A, B and C are decoded into lengths 3, 5 and 1, the next time we encounter A, B and C, their lengths will still be the same!– cache the ILD work– do once, reuse many times

Lecture 6: Superscalar Decode and Other Pipelining

Page 31: Advanced  Microarchitecture

33

Decoder Example: AMD K5

Lecture 6: Superscalar Decode and Other Pipelining

Predecode Logic

From Memoryb0 b1 b2 … b78 bytes

I$

b0 b1 b2 … b7

8 bits

+5 bits13 bytes

Decode16 (8-bit inst + 5-bit predecode)

8 (8-bit inst + 5-bit predecode)

Up to 4 ROPs

Page 32: Advanced  Microarchitecture

34

Decoder Example: AMD K5• Predecode information makes decode

easier– Instruction start/end location (ILD)– Number of ROPs needed per inst– Opcode and prefix locations

• Power/performance tradeoffs– Larger I$ (increase data by 62.5%)

• Longer I$ latency, More I$ power consumption– Remove logic from decode

• Shorter branch mispred penalty, simpler logic• Cache and reused decode work less decode power

– Longer effective IL1 miss latency

Lecture 6: Superscalar Decode and Other Pipelining

Page 33: Advanced  Microarchitecture

35

Limits on Decode• Max branches (color allocation)• Taken branches• Incomplete instructions

– x86 insts are not aligned, may span two cache lines

– can’t decode until both halves have been fetched

• Instruction complexity– decoding “complex” (2-4 uop) instructions

requires a more complex decoder; expensive to replicate

– compromise: fewer complex decoders plus simpler decoders for instructions with single-uop mappings

Lecture 6: Superscalar Decode and Other Pipelining

Page 34: Advanced  Microarchitecture

36

Decoder Example: Intel P-Pro

Lecture 6: Superscalar Decode and Other Pipelining

16 Raw Instruction Bytes

Decoder0

Decoder1

Decoder2

BranchAddress

Calculator

Fetch resteerif needed

mROM

4 uops 1 uop 1 uop

If instruction in Decoder 1 or 2 requires > 1 uop, do not generateany output, and then shift to Decoder to the left on next cycle

Only Decoder 0 can interface with the uROM and MS

Page 35: Advanced  Microarchitecture

37

Decoder Example: Intel P4

Lecture 6: Superscalar Decode and Other Pipelining

L2 CacheRaw instruction bytes

4-uop DecoderuROM

trace const. buffer

Trace Cache

Decode at mostone inst per cycle

Fetch up to 3 uopsper cycleP4 has a strangled front-end, at best it

can only deliver 3 uops per cycle; contrast to P-Pro that can deliver up to 6 uops per cycle (if they’re 4/1/1)

More on this when westudy the P4 in detail

Page 36: Advanced  Microarchitecture

38

Pipeline Control

Lecture 6: Superscalar Decode and Other Pipelining

Dispatch ROB, LSQ, RSfull, stall front-end

Disp

ROB, LSQ, RSfull, stall front-end

RenDecDecRotILDI$I$BPred

Except not everyone stalls…

This logic starting toget pretty intense

DecodeFetch

Page 37: Advanced  Microarchitecture

39

Just because there’s a stall conditionsomewhere does not imply that

everybody has to stall

Non-Uniform Stall

Lecture 6: Superscalar Decode and Other Pipelining

Disp

nops due to I$ miss

ROBFull!

not fullfull

fullfullfull

RenDecDecRotILDI$I$BP

Page 38: Advanced  Microarchitecture

40

Compressing/Serpentine Pipelines

Lecture 6: Superscalar Decode and Other Pipelining

DispROBFull!

1 entryfree

RenDecDecRotILDI$I$BPBetter “flow”, but much more

complex since need to track howmany insts can advance per stage

Page 39: Advanced  Microarchitecture

41

Lots o’ Stalls• I$ miss, ITLB miss• Decoder limitations

– x86 4-1-1 limit– branch limits (max/cycle, max taken/cycle)

• Renamer – out of physical registers

Lecture 6: Superscalar Decode and Other Pipelining

Page 40: Advanced  Microarchitecture

42

Smaller Control Domains• Separate long pipeline into multiple smaller

pipelines

Lecture 6: Superscalar Decode and Other Pipelining

BPred I$ I$ Dec Dec Dec Ren Alloc Sched

BPred I$ I$ Dec Dec Dec

Ren Alloc Sched

Page 41: Advanced  Microarchitecture

43

Smaller Control Domains (2)

• Non-decoupled pipe needed logic to simultaneously control ~10 stages

• Decoupled pipe needs multiple control logic circuits– each only needs to interact with ~5 stages (~3

real stages, plus the queue ahead and behind)

Lecture 6: Superscalar Decode and Other Pipelining

Pipeline ControlLogic

Non-decoupledI$ Dec Dec Dec Ren

Pipeline ControlLogic

Decoupled

XX

No direct control logicfor stages outside of

local pipeline

Page 42: Advanced  Microarchitecture

44

Smaller Control Domains (3)• Queues can effectively add more pipeline

stages

Lecture 6: Superscalar Decode and Other Pipelining

cycleboundary

previous stage next stage

Inter-pipe queueenqueue

logic latch dequeuelogic

previous stage next stage

• Avoid this by writing and reading in the same cycle (affects timing, complexity)

Page 43: Advanced  Microarchitecture

45

Queues provide Smoothing• Approximation to serpentine pipes

(compress only at certain locations – i.e., the queues)

• Different levels of decoupling possible depending on frequency target, power, complexity tolerance

Lecture 6: Superscalar Decode and Other Pipelining

BPred I$ I$ Dec Ren Sched

The “SimpleScalar” pipeline

Note: RS is effectivelya queue (more later)

Page 44: Advanced  Microarchitecture

46

Different Clocking Domains• Decoupling the pipe allows each segment

to operate independently (local control)• Also means each can run at different

speeds (P4)

Lecture 6: Superscalar Decode and Other Pipelining

TC TC Dec … Alloc

Sched … WB(ROB)

Commit(IAQ)

(uopQ)1x freq

(3 uops/Mclk)½x freq(6 uops/2Mclk’s 3 uops/Mclk)

2x freq(2 uops/Fclk 4 uops/Mclk)

1x freq(3 uops/Mclk)