Lecture 6: Superscalar Decode and Other Pipelining

Advanced MicroarchitectureLecture 6: Superscalar Decode and Other Pipelining

RISC ISA Format• This should be review…

– Fixed-length• MIPS all insts are 32-bits/4 bytes

– Few formats• MIPS has 3: R-, I-, J- formats• Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP

– Regularity across formats (when possible/practical)• MIPS, Alpha opcode in same bit-position for all formats• MIPS rs & rt fields in same bit-position for I- and J-

formats• Alpha ra/fa field in same bit-position for all 5 formats

Lecture 6: Superscalar Decode and Other Pipelining

RISC Decode (MIPS)

000 001 010 011 100 101 110 111000 func rt j jal beq bne blez bgtz

001 addiaddi

uslti sltiu andi ori xori lui

010 rs rs rs rs011100 lb lh lwl lw lbu lhu lwr101 sb sh swl sw swr110 lwc0 lwc1 lwc2 lwc3111 swc0 swc1 swc2 swc3

opcode6

other21

R-format only

code[5

opcode[2,0]

001xxx = Immediate

1xxxxx = Memory(1x0: LD, 1x1: ST)

000xxx = Br/Jump(except for 000000)

Superscalar Decode for RISC ISAs• To sustain X instructions per cycle, must

decode X instructions per cycle– Just duplicate the hardware

32-bit inst

Decoder

decodedinst

scalar

Decoder Decoder Decoder

32-bit inst

Decoder

decodedinst

superscalar

4-wide superscalar fetch

32-bit inst32-bit inst32-bit inst

decodedinst

1-Fetch

VLIW/EPIC ISAs• Compiler finds the parallelism, packs

multiple instructions into a “very long” instruction

Branch

IMB Add Load Branch

IMI Sub Store Xor

VLIW (EPIC-like)

“template” bits

VLIW Decoder• Similar to superscalar RISC decoder

Decoder

TemplateDecoder

Decoder Decoder

inst1 inst2

decodedinst

CISC ISA• RISC focus on fast access to information

– easy decode, I$, large RF’s, D$• CISCs are older

– designed in era with fewer transistors, chips– each memory access very expensive

• pack as much work into as few bytes as possible• more “expressive” instructions

– compare to simple RISC insts– better potential code generation– more complex code generation in practice

Example: VAX• Superset of ISAs, incl. IBM360, DEC PDP-11• VAX = “Virtual Address Extension”• 16 32-bit registers• 32-bit memory addressing• Encoding:

– 1-2 byte opcode, followed by 0-6 operand specifiers, each of which may be up to 5 bytes

– Opcode implies datatype, size, # operands– Orthogonality: any opcode with any addressing

VAX operand addressingMode Example Meaning

Register Add R4, R3 R4 = R4 + R3

Immediate Add R4, #3 R4 = R4 + 3

Displacement Add R4, 100(R1) R4 = R4 + Mem[100+R1]

Register Indirect

Add R4, (R1) R4 = R4 + Mem[R1]

Indexed/Base Add R3, (R1+R2) R3 = R3 + Mem[R1+R2]

Direct/Absolute Add R1, (1234) R1 = R1 + Mem[1234]

Memory Indirect

Add R1, @(R3) R1 = R1 + Mem[Mem[R3]]

Auto-Increment Add R1,(R2)+ R1 = R1 + Mem[R2]; R2++

Auto-Decrement

Add R1, -(R2) R2--; R1 = R1 + Mem[R2]Lecture 6: Superscalar Decode and Other Pipelining

Any mode could be applied to any instruction!

x86• CISC, stemming from the original 4004• Example: “Move” instructions

1. General Purpose data movement– RR, MR, RM, IR, IM

2. Exchanges– EAX ↔ ECX, byte order within a register

3. Stack Manipulation– push pop R ↔ Stack, PUSHA/POPA

4. Type Conversion5. Conditional Moves

Many ways to do the same/similar operation

x86 Encoding• Basic x86 Instruction:

Prefixes0-4 bytes

Opcode1-2 bytes

Mod R/M0-1 bytes

SIB0-1 bytes

Displacement0/1/2/4 bytes

Immediate0/1/2/4 bytes

Longest Inst 15 bytesShortest Inst: 1 byte

• Opcode specifies operation, and if the Mod R/M byte is used– Most instructions use the Mod R/M byte– Mod R/M specifies if optional SIB byte is

used– Mod R/M and SIB may specify additional

constants

Mod R/M Byte

• Mode = 00: No-displacement, use Mem[ regmmm ]

• Mode = 01: 8-bit displacement, Mem[ regmmm + SExt(disp) ]

• Mode = 10: 32-bit displacement (similar to previous)

• Mode = 11: Register-to-Register, use regmmm

M M r r r m m m

Mode Register R/M

1 of 8 registers

Add EBX, ECX 11 011 001Mod R/M

Add EBX, [ECX] 00 011 001Mod R/M

Exceptions• Mod=00, R/M = 5 get operand from 32-

bit immediate– Add EDX =

EDX+Mem[0cff1234]• Mod=00, 01 or 10, R/M = 4 use the “SIB”

byte– SIB = Scale/Index/Base

000101010cff1234

00010100

Mod R/M

ss iii bbb

SIBbbb 5: use regbbb

bbb = 5: use 32-bit imm(Mod = 00 only)

iii 4: use siiii = 4: use 0

si = regiii << ss

Opcode Confusion• There are different opcodes for AB and

10001011 11000011 MOV EAX, EBX

10001001 11000011 MOV EBX, EAX

10001001 11011000 MOV EAX, EBX

• If Opcode = 0F, then use next byte as opcode

• If Opcode = D8-DF, then FP instruction

11011000 11 R/M

FP opcode

x86 Decode Example

11000111

MOV regimm (use Mod R/M, 32-bit Imm to follow)

opcode10000100

Mod=2 (use 32-bit Disp)R/M = 4 (use SIB)

reg ignored

Mod R/M

11000011SIB

ss=3 Scale by 8use EAX, EBX

Disp Imm

*( (EAX<<3) + EBX + Disp ) = ImmTotal: 11 byte instruction

Note: Add 4 prefixes, andyou reach the max size

In RISC (MIPS)

1. lui R1 = Disp[31:16]2. ori R1 = R1,

Disp[15:0]3. add R1 = R1 + R24. shli R3 = R3 << 35. add R3 = R3 + R16. lui R1 = Imm[31:16]7. ori R1 = R1,

Imm[15:0]8. st [R3] R1

8 instructions, 32 bits each32 bytes total

2.9x Bigger!

x86-64 / EM64T• 816 general purpose registers

– only 3-bit register fields?• Registers extended from 3264 bits each• Default: instructions still 32-bit

– New “REX” prefix byte to specify additional information

0100 m R I B opcode

m=0 64-bit modem=1 32-bit mode

md rrr mrm

ss iii bbb

iiiIbbbB

Register specifiers are now 4 bitseach: can choose 1 of 16 registers

64-bit Extensions to IA32

IA32+64-bit exts IA32

CPU architect

(Taken from Bob Colwell’s Eckert-MauchlyAward Talk, ISCA 2005)

Ugly? Scary?… but it works

x86 Decode Hardware

PrefixDecoder

Left Shift

Num Prefixes

opcodedecoder

2nd opcodedecoder

Mod R/Mdecoder

SIBdecoder

Left Shift Left Shift

Instruction bytes

Decoded x86 Format• RISC: easy to expand union of needed

info– generalized opcode (not too hard)– reg1, reg2, reg3, immediate (possibly extended)– some fields ignored

• CISC: union of all possible info is huge– generalized opcode (too many options)– up to 3 regs, 2 immediates– segment information– “rep” specifiers

• would lead to 100’s of bits• common case only needs a fraction a lot of waste

x86 RISC-like mops• Each x86 instruction decoded into a variable

number of “uops” (micro-ops - Intel) or ROPs (RISC ops - AMD)– Each uop is RISC-like– Uops have limitations to keep union of info practical

ADD EAX, EBX ADD EAX, EBX 1 uop

ADD EAX, [EBX] Load tmp = [EBX]ADD EAX, tmp

2 uops

ADD [EAX], EBX Load tmp = [EAX]ADD tmp, EBX

STA EAXSTD tmp

4 uops

uop Limits• How many uops can a decoder generate?

– For complex x86 insts, many are needed (10’s, 100’s?)

– Makes decoder horribly complex– Typically there’s a limit to keep complexity

under control• One x86 instruction 1-4 uops• Most instructions translate to 1.5-2.0 uops

• Ok, what happens if a complex instruction needs more than 4 uops?

UROM/MS for Complex x86 Insts• UROM (mcode-ROM) stores the uop

equivalents for nasty x86 instructions– “Nasty” could be large/complex (> 4 uops like

PUSHA or STRREP.MOV) or obsolete instructions (AAA)

• Microsequencer (MS) is the control logic that interfaces between the post-decode pipestages, the UROM, the decoders and the PC-generation

UROM/MS Example (3 uop-wide)

Cycle 1

Cycle 2

REP.MOV

ADD [ ]

Fetch- x86 insts

Decode - uops

UROM - uops

REP.MOV

ADD [ ]

Cycle 3

Cycle 4

Cycle 5

Cycle …

mJCCREP.MOV

ADD [ ]

Cycle n Cycle n+1

Complex instructions, getuops from mcode sequencer

Superscalar CISC Decode

1. Instruction Length Decode (ILD)• Where are the instructions?

• Limited decode – just enough to parse prefixes, modes

2. Shift/Alignment• Get the right bytes to the decoders

3. Decode• Crack into uops

And then do this for N instructions per cycle!

ILD Recurrence/Loop• PCi = X

• PCi+1= PCi + sizeof( Mem[PCi] )

• PCi+2= PCi+1 + sizeof( Mem[PCi+1] )

= PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] )

• Can’t find start of next instruction without decoding the first

• Critical loop not pipelineable– ILD of 4 instructions per cycle imples that clock cycle

time will be 4 x latency(ILD)

Decode Implementation

Left Shifter

Decoder 3

Decoder 2Decoder 1

ILD dominatescycle time; not

scalable

Instruction Bytes (ex. 16 bytes)

ILD (limited decode)Length 1

Length 2

Inst 1 Inst 2 Inst 3 Remainder

Length 3

bytesdecoded

ILD (limited decode)

Hardware-Intensive Decode

Decode from everypossible instructionstarting point!

Giant MUXes toselect instructionbytes

DecoderIL

DecoderDecoder

ILD in Hardware-Intensive Approach

6 bytes

4 bytes

3 bytes+

Total bytes decode = 11Previous: 3 ILD + 2add

Now: 1ILD + 2(mux+add)

Predecoding• ILD loop is hardware intensive, impacts

latency, and can consume substantial power

• Observation: when instructions A, B and C are decoded into lengths 3, 5 and 1, the next time we encounter A, B and C, their lengths will still be the same!– cache the ILD work– do once, reuse many times

Decoder Example: AMD K5

Predecode Logic

From Memory

b0 b1 b2 … b78 bytes

b0 b1 b2 … b7

8 bits

+5 bits

13 bytes

Decode

16 (8-bit inst + 5-bit predecode)

8 (8-bit inst + 5-bit predecode)

Up to 4 ROPs

Decoder Example: AMD K5• Predecode information makes decode

easier– Instruction start/end location (ILD)– Number of ROPs needed per inst– Opcode and prefix locations

• Power/performance tradeoffs– Larger I$ (increase data by 62.5%)

• Longer I$ latency, More I$ power consumption– Remove logic from decode

• Shorter branch mispred penalty, simpler logic• Cache and reused decode work less decode power

– Longer effective IL1 miss latency

Limits on Decode• Max branches (color allocation)• Taken branches• Incomplete instructions

– x86 insts are not aligned, may span two cache lines

– can’t decode until both halves have been fetched

• Instruction complexity– decoding “complex” (2-4 uop) instructions

requires a more complex decoder; expensive to replicate

– compromise: fewer complex decoders plus simpler decoders for instructions with single-uop mappings

Decoder Example: Intel P-Pro

16 Raw Instruction Bytes

Decoder0

Decoder1

Decoder2

BranchAddress

Calculator

Fetch resteerif needed

4 uops 1 uop 1 uop

If instruction in Decoder 1 or 2 requires > 1 uop, do not generateany output, and then shift to Decoder to the left on next cycle

Only Decoder 0 can interface with the uROM and MS

Decoder Example: Intel P4

L2 Cache

Raw instruction bytes

4-uop DecoderuROM

trace const. buffer

Trace Cache

Decode at mostone inst per cycle

Fetch up to 3 uopsper cycle

P4 has a strangled front-end, at best it can only deliver 3 uops per cycle; contrast to P-Pro that can deliver up to 6 uops per cycle (if they’re 4/1/1)

More on this when westudy the P4 in detail

Pipeline Control

Dispatch ROB, LSQ, RSfull, stall front-end

ROB, LSQ, RSfull, stall front-end

RenDecDecRotILDI$I$BPred

Except not everyone stalls…

This logic starting toget pretty intense

DecodeFetch

Just because there’s a stall conditionsomewhere does not imply that

everybody has to stall

Non-Uniform Stall

nops due to I$ miss

ROBFull!

not full

RenDecDecRotILDI$I$BP

Compressing/Serpentine Pipelines

ROBFull!

1 entryfree

RenDecDecRotILDI$I$BP

Better “flow”, but much morecomplex since need to track howmany insts can advance per stage

Lots o’ Stalls• I$ miss, ITLB miss• Decoder limitations

– x86 4-1-1 limit– branch limits (max/cycle, max taken/cycle)

• Renamer – out of physical registers

Smaller Control Domains• Separate long pipeline into multiple smaller

pipelines

BPred I$ I$ Dec Dec Dec Ren Alloc Sched

BPred I$ I$ Dec Dec Dec

Ren Alloc Sched

Smaller Control Domains (2)

• Non-decoupled pipe needed logic to simultaneously control ~10 stages

• Decoupled pipe needs multiple control logic circuits– each only needs to interact with ~5 stages (~3

real stages, plus the queue ahead and behind)

Pipeline ControlLogic

Non-decoupled

I$ Dec Dec Dec Ren

Pipeline ControlLogic

Decoupled

No direct control logicfor stages outside of

local pipeline

Smaller Control Domains (3)

• Queues can effectively add more pipeline stages

cycleboundary

previous stage next stage

Inter-pipe queueenqueue

logic latchdequeue

previous stage next stage

• Avoid this by writing and reading in the same cycle (affects timing, complexity)

Queues provide Smoothing• Approximation to serpentine pipes

(compress only at certain locations – i.e., the queues)

• Different levels of decoupling possible depending on frequency target, power, complexity tolerance

BPred I$ I$ Dec Ren Sched

The “SimpleScalar” pipeline

Note: RS is effectivelya queue (more later)

Different Clocking Domains• Decoupling the pipe allows each segment

to operate independently (local control)• Also means each can run at different

speeds (P4)

TC TC Dec … Alloc

Sched … WB

Commit

(uopQ)

1x freq(3 uops/Mclk)½x freq

(6 uops/2Mclk’s 3 uops/Mclk)

2x freq(2 uops/Fclk 4 uops/Mclk)

1x freq(3 uops/Mclk)

Lecture 6: Superscalar Decode and Other Pipelining

Documents

7 a3 Instruction-Level Parallel Processing: 2 History, …ece740/f11/lib/exe/fetch.php?...Instmaion-level parallelism, VLIW processors, superscalar processors, pipelining, multiple

CS311 Lecture: Pipelining and Superscalar Architectures › local › courses › cps311 › lectures... · CS311 Lecture: Pipelining and Superscalar Architectures Last revised June

Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth

Superscalar Processors

BM-311 Bilggyisayar Mimarisi · 2018-09-26 · RISC ve superscalar işlemciler fixed-length komutları kullanır. Komut formatları Komut uzunluklar ının aynı olması decode işlemini

Superscalar Processing

CSE502: Computer Architecture Superscalar Decode

Architecture Basics ECE 454 Computer Systems Programming Topics: Basics of Computer Architecture Pipelining, Branches, Superscalar, Out of order Execution

Dynamic Scheduling Techniques for VLIW Processors · 2018-09-13 · instruction-level parallelism, VLIW processors, superscalar processors, pipelining, multiple operation issue, scoreboarding,

Υπρβαθμω ή (superscalar) Οράνωση Υπολοισών€¦ · Ερονής σωλήνωση ο Power4 (2001) I-Cache PC BR Scan BR Predict Fetch Q Decode Reorder Buffer

Superscalar - summary

CH14 Instruction Level Parallelism and Superscalar Processors CH01 TECH Computer Science Decode and issue more and one instruction at a time Executing

Jin, Hai - Huazhong University of Science and …grid.hust.edu.cn/uploads/files/lectures/lect2.pdf · Jin, Hai School of Computer ... Uniprocessor Parallelism Pipelining, Superscalar,

Superscalar Architectures

Hardware-Software Approaches to In-Circuit Emulation for ......such as pipelining and superscalar execution so as to retain the logical sequence and eliminate false conditions caused

Chapter 4 Superscalar Organization. Superscalar Organization Superscalar machines go beyond just a single-instruction pipeline Able to simultaneously

CSE 502: Computer Architecture - Stony Brook …...CSE502: Computer Architecture Superscalar Decode for RISC ISAs •Decode X insns. per cycle (e.g., 4-wide) –Just duplicate the

Performance and Optimisation · • superscalar, SIMD(vector), pipelining • Locality is at least as important as computation • Temporal: re-use of recently used data • Spatial:

PIPELINING basics - · PIPELINING basics • A pipelined architecture for MIPS • Hurdles in pipelining • Simple solutions to pipelining hurdles • Advanced pipelining

COMP Superscalar