Upload
vince
View
42
Download
0
Embed Size (px)
DESCRIPTION
Advanced Microarchitecture. Lecture 6: Superscalar Decode and Other Pipelining. RISC ISA Format. This should be review… Fixed-length MIPS all insts are 32-bits/4 bytes Few formats MIPS has 3: R-, I-, J- formats Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP - PowerPoint PPT Presentation
Citation preview
Advanced MicroarchitectureLecture 6: Superscalar Decode and Other Pipelining
2
RISC ISA Format• This should be review…
– Fixed-length• MIPS all insts are 32-bits/4 bytes
– Few formats• MIPS has 3: R-, I-, J- formats• Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP
– Regularity across formats (when possible/practical)• MIPS, Alpha opcode in same bit-position for all formats• MIPS rs & rt fields in same bit-position for I- and J-
formats• Alpha ra/fa field in same bit-position for all 5 formats
Lecture 6: Superscalar Decode and Other Pipelining
3
RISC Decode (MIPS)
000 001 010 011 100 101 110 111000 func rt j jal beq bne blez bgtz001 addi addi
u slti sltiu andi ori xori lui010 rs rs rs rs011100 lb lh lwl lw lbu lhu lwr101 sb sh swl sw swr110 lwc0 lwc1 lwc2 lwc3111 swc0 swc1 swc2 swc3
Lecture 6: Superscalar Decode and Other Pipelining
opcode6
other21
func5
R-format only
opco
de[5
,3]
opcode[2,0]
001xxx = Immediate
1xxxxx = Memory(1x0: LD, 1x1: ST)
000xxx = Br/Jump(except for 000000)
6
Superscalar Decode for RISC ISAs• To sustain X instructions per cycle, must
decode X instructions per cycle– Just duplicate the hardware
Lecture 6: Superscalar Decode and Other Pipelining
32-bit inst
Decoder
decodedinst
scalar
Decoder Decoder Decoder
32-bit inst
Decoder
decodedinst
superscalar
4-wide superscalar fetch
32-bit inst32-bit inst32-bit inst
decodedinst
decodedinst
decodedinst
1-Fetch
7
VLIW/EPIC ISAs• Compiler finds the parallelism, packs
multiple instructions into a “very long” instruction
Lecture 6: Superscalar Decode and Other Pipelining
AddLoad
BranchSub
StoreXorRISC
IMB Add Load Branch
IMI Sub Store Xor
VLIW (EPIC-like)
“template” bits
8
VLIW Decoder• Similar to superscalar RISC decoder
Lecture 6: Superscalar Decode and Other Pipelining
Decoder
inst0
TemplateDecoder
tmplt
Decoder Decoder
inst1 inst2
decodedinst
decodedinst
decodedinst
9
CISC ISA• RISC focus on fast access to information
– easy decode, I$, large RF’s, D$• CISCs are older
– designed in era with fewer transistors, chips– each memory access very expensive
• pack as much work into as few bytes as possible• more “expressive” instructions
– compare to simple RISC insts– better potential code generation– more complex code generation in practice
Lecture 6: Superscalar Decode and Other Pipelining
10
Example: VAX• Superset of ISAs, incl. IBM360, DEC PDP-11• VAX = “Virtual Address Extension”• 16 32-bit registers• 32-bit memory addressing• Encoding:
– 1-2 byte opcode, followed by 0-6 operand specifiers, each of which may be up to 5 bytes
– Opcode implies datatype, size, # operands– Orthogonality: any opcode with any addressing
mode
Lecture 6: Superscalar Decode and Other Pipelining
11
VAX operand addressingMode Example MeaningRegister Add R4, R3 R4 = R4 + R3Immediate Add R4, #3 R4 = R4 + 3Displacement Add R4, 100(R1) R4 = R4 + Mem[100+R1]Register Indirect
Add R4, (R1) R4 = R4 + Mem[R1]
Indexed/Base Add R3, (R1+R2) R3 = R3 + Mem[R1+R2]Direct/Absolute Add R1, (1234) R1 = R1 + Mem[1234]Memory Indirect
Add R1, @(R3) R1 = R1 + Mem[Mem[R3]]
Auto-Increment Add R1,(R2)+ R1 = R1 + Mem[R2]; R2++
Auto-Decrement
Add R1, -(R2) R2--; R1 = R1 + Mem[R2]Lecture 6: Superscalar Decode and Other Pipelining
Any mode could be applied to any instruction!
12
x86• CISC, stemming from the original 4004• Example: “Move” instructions
1. General Purpose data movement– RR, MR, RM, IR, IM
2. Exchanges– EAX ↔ ECX, byte order within a register
3. Stack Manipulation– push pop R ↔ Stack, PUSHA/POPA
4. Type Conversion5. Conditional Moves
Lecture 6: Superscalar Decode and Other Pipelining
Many ways to do the same/similar operation
13
x86 Encoding• Basic x86 Instruction:
Lecture 6: Superscalar Decode and Other Pipelining
Prefixes0-4 bytes
Opcode1-2 bytes
Mod R/M0-1 bytes
SIB0-1 bytes
Displacement0/1/2/4 bytes
Immediate0/1/2/4 bytes
Longest Inst 15 bytesShortest Inst: 1 byte
• Opcode specifies operation, and if the Mod R/M byte is used– Most instructions use the Mod R/M byte– Mod R/M specifies if optional SIB byte is
used– Mod R/M and SIB may specify additional
constants
14
Mod R/M Byte
• Mode = 00: No-displacement, use Mem[ regmmm ]• Mode = 01: 8-bit displacement, Mem[ regmmm +
SExt(disp) ]• Mode = 10: 32-bit displacement (similar to
previous)• Mode = 11: Register-to-Register, use regmmm
Lecture 6: Superscalar Decode and Other Pipelining
M M r r r m m m
Mode Register R/M
1 of 8 registers
Add EBX, ECX 11 011 001Mod R/M
Add EBX, [ECX] 00 011 001Mod R/M
15
Exceptions• Mod=00, R/M = 5 get operand from 32-
bit immediate– Add EDX =
EDX+Mem[0cff1234]• Mod=00, 01 or 10, R/M = 4 use the “SIB”
byte– SIB = Scale/Index/Base
Lecture 6: Superscalar Decode and Other Pipelining
000101010cff1234
00010100Mod R/M
ss iii bbbSIB
bbb 5: use regbbbbbb = 5: use 32-bit imm(Mod = 00 only)
iii 4: use siiii = 4: use 0
si = regiii << ss
16
Opcode Confusion• There are different opcodes for AB and
BA
Lecture 6: Superscalar Decode and Other Pipelining
10001011 11000011 MOV EAX, EBX
10001001 11000011 MOV EBX, EAX
10001001 11011000 MOV EAX, EBX
• If Opcode = 0F, then use next byte as opcode
• If Opcode = D8-DF, then FP instruction
11011000 11 R/MFP opcode
17
x86 Decode Example
Lecture 6: Superscalar Decode and Other Pipelining
11000111
MOV regimm (use Mod R/M, 32-bit Imm to follow)
opcode10000100
Mod=2 (use 32-bit Disp)R/M = 4 (use SIB)
reg ignored
Mod R/M11000011
SIB
ss=3 Scale by 8use EAX, EBX
Disp Imm
*( (EAX<<3) + EBX + Disp ) = ImmTotal: 11 byte instruction
Note: Add 4 prefixes, andyou reach the max size
18
In RISC (MIPS)1. lui R1 = Disp[31:16]2. ori R1 = R1,
Disp[15:0]3. add R1 = R1 + R24. shli R3 = R3 << 35. add R3 = R3 + R16. lui R1 = Imm[31:16]7. ori R1 = R1,
Imm[15:0]8. st [R3] R1
Lecture 6: Superscalar Decode and Other Pipelining
8 instructions, 32 bits each32 bytes total
2.9x Bigger!
19
x86-64 / EM64T• 816 general purpose registers
– only 3-bit register fields?• Registers extended from 3264 bits each• Default: instructions still 32-bit
– New “REX” prefix byte to specify additional information
Lecture 6: Superscalar Decode and Other Pipelining
0100 m R I B opcode
m=0 64-bit modem=1 32-bit mode
md rrr mrm
R rrr
ss iii bbb
iiiI bbbB
REX
Register specifiers are now 4 bitseach: can choose 1 of 16 registers
20
64-bit Extensions to IA32
Lecture 6: Superscalar Decode and Other Pipelining
IA32+64-bit exts IA32
CPU architect
(Taken from Bob Colwell’s Eckert-MauchlyAward Talk, ISCA 2005)
Ugly? Scary?… but it works
21
x86 Decode Hardware
Lecture 6: Superscalar Decode and Other Pipelining
PrefixDecoder Left Shift
Num Prefixes
opcodedecoder
2nd opcodedecoder
Mod R/Mdecoder
SIBdecoder
Left Shift Left Shift
+ +
Instruction bytes
22
Decoded x86 Format• RISC: easy to expand union of needed
info– generalized opcode (not too hard)– reg1, reg2, reg3, immediate (possibly extended)– some fields ignored
• CISC: union of all possible info is huge– generalized opcode (too many options)– up to 3 regs, 2 immediates– segment information– “rep” specifiers
• would lead to 100’s of bits• common case only needs a fraction a lot of waste
Lecture 6: Superscalar Decode and Other Pipelining
23
x86 RISC-like mops• Each x86 instruction decoded into a variable
number of “uops” (micro-ops - Intel) or ROPs (RISC ops - AMD)– Each uop is RISC-like– Uops have limitations to keep union of info practical
Lecture 6: Superscalar Decode and Other Pipelining
ADD EAX, EBX ADD EAX, EBX 1 uop
ADD EAX, [EBX] Load tmp = [EBX]ADD EAX, tmp 2 uops
ADD [EAX], EBX Load tmp = [EAX]ADD tmp, EBX
STA EAXSTD tmp
4 uops
24
uop Limits• How many uops can a decoder generate?
– For complex x86 insts, many are needed (10’s, 100’s?)
– Makes decoder horribly complex– Typically there’s a limit to keep complexity
under control• One x86 instruction 1-4 uops• Most instructions translate to 1.5-2.0 uops
• Ok, what happens if a complex instruction needs more than 4 uops?
Lecture 6: Superscalar Decode and Other Pipelining
25
UROM/MS for Complex x86 Insts• UROM (mcode-ROM) stores the uop
equivalents for nasty x86 instructions– “Nasty” could be large/complex (> 4 uops like
PUSHA or STRREP.MOV) or obsolete instructions (AAA)
• Microsequencer (MS) is the control logic that interfaces between the post-decode pipestages, the UROM, the decoders and the PC-generation
Lecture 6: Superscalar Decode and Other Pipelining
26
UROM/MS Example (3 uop-wide)
Lecture 6: Superscalar Decode and Other Pipelining
ADDSTORE
SUB
Cycle 1
ADDSTASTD
SUB
Cycle 2
REP.MOVADD [ ]
Fetch- x86 instsDecode - uopsUROM - uops
SUBLOADSTORE
REP.MOVADD [ ]
XOR
Cycle 3
INCmJCCLOAD
Cycle 4
STOREINC
mJCC
Cycle 5
…
Cycle …
……
mJCCREP.MOVADD [ ]
XORLOAD
Cycle n Cycle n+1
ADD
Complex instructions, getuops from mcode sequencer
27
Superscalar CISC Decode1. Instruction Length Decode (ILD)
• Where are the instructions?• Limited decode – just enough to parse prefixes,
modes2. Shift/Alignment
• Get the right bytes to the decoders3. Decode
• Crack into uops
Lecture 6: Superscalar Decode and Other Pipelining
And then do this for N instructions per cycle!
28
ILD Recurrence/Loop• PCi = X• PCi+1= PCi + sizeof( Mem[PCi] )• PCi+2= PCi+1 + sizeof( Mem[PCi+1] )
= PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] )
• Can’t find start of next instruction without decoding the first
• Critical loop not pipelineable– ILD of 4 instructions per cycle imples that clock cycle
time will be 4 x latency(ILD)
Lecture 6: Superscalar Decode and Other Pipelining
29
Decode Implementation
Lecture 6: Superscalar Decode and Other Pipelining
Left Shifter
Decoder 3
Cycle
3
Decoder 2Decoder 1ILD dominatescycle time; not
scalable
Instruction Bytes (ex. 16 bytes)ILD (limited decode)
Length 1
Length 2
+Cycle
1
Inst 1 Inst 2 Inst 3 Remainder
Cycle
2
+
Length 3
bytesdecoded
ILD (limited decode)ILD (limited decode)
30
Hardware-Intensive Decode
Lecture 6: Superscalar Decode and Other Pipelining
Decode from everypossible instructionstarting point!
Giant MUXes toselect instructionbytes
ILDDecoder
ILDILDILDILDILDILDILDILDILDILDILDILDILDILDILD
DecoderDecoder
31
ILD in Hardware-Intensive Approach
Lecture 6: Superscalar Decode and Other Pipelining
6 bytes
4 bytes
3 bytes+
Total bytes decode = 11Previous: 3 ILD + 2add
Now: 1ILD + 2(mux+add)
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
32
Predecoding• ILD loop is hardware intensive, impacts
latency, and can consume substantial power
• Observation: when instructions A, B and C are decoded into lengths 3, 5 and 1, the next time we encounter A, B and C, their lengths will still be the same!– cache the ILD work– do once, reuse many times
Lecture 6: Superscalar Decode and Other Pipelining
33
Decoder Example: AMD K5
Lecture 6: Superscalar Decode and Other Pipelining
Predecode Logic
From Memoryb0 b1 b2 … b78 bytes
I$
b0 b1 b2 … b7
8 bits
+5 bits13 bytes
Decode16 (8-bit inst + 5-bit predecode)
8 (8-bit inst + 5-bit predecode)
Up to 4 ROPs
34
Decoder Example: AMD K5• Predecode information makes decode
easier– Instruction start/end location (ILD)– Number of ROPs needed per inst– Opcode and prefix locations
• Power/performance tradeoffs– Larger I$ (increase data by 62.5%)
• Longer I$ latency, More I$ power consumption– Remove logic from decode
• Shorter branch mispred penalty, simpler logic• Cache and reused decode work less decode power
– Longer effective IL1 miss latency
Lecture 6: Superscalar Decode and Other Pipelining
35
Limits on Decode• Max branches (color allocation)• Taken branches• Incomplete instructions
– x86 insts are not aligned, may span two cache lines
– can’t decode until both halves have been fetched
• Instruction complexity– decoding “complex” (2-4 uop) instructions
requires a more complex decoder; expensive to replicate
– compromise: fewer complex decoders plus simpler decoders for instructions with single-uop mappings
Lecture 6: Superscalar Decode and Other Pipelining
36
Decoder Example: Intel P-Pro
Lecture 6: Superscalar Decode and Other Pipelining
16 Raw Instruction Bytes
Decoder0
Decoder1
Decoder2
BranchAddress
Calculator
Fetch resteerif needed
mROM
4 uops 1 uop 1 uop
If instruction in Decoder 1 or 2 requires > 1 uop, do not generateany output, and then shift to Decoder to the left on next cycle
Only Decoder 0 can interface with the uROM and MS
37
Decoder Example: Intel P4
Lecture 6: Superscalar Decode and Other Pipelining
L2 CacheRaw instruction bytes
4-uop DecoderuROM
trace const. buffer
Trace Cache
Decode at mostone inst per cycle
Fetch up to 3 uopsper cycleP4 has a strangled front-end, at best it
can only deliver 3 uops per cycle; contrast to P-Pro that can deliver up to 6 uops per cycle (if they’re 4/1/1)
More on this when westudy the P4 in detail
38
Pipeline Control
Lecture 6: Superscalar Decode and Other Pipelining
Dispatch ROB, LSQ, RSfull, stall front-end
Disp
ROB, LSQ, RSfull, stall front-end
RenDecDecRotILDI$I$BPred
Except not everyone stalls…
This logic starting toget pretty intense
DecodeFetch
39
Just because there’s a stall conditionsomewhere does not imply that
everybody has to stall
Non-Uniform Stall
Lecture 6: Superscalar Decode and Other Pipelining
Disp
nops due to I$ miss
ROBFull!
not fullfull
fullfullfull
RenDecDecRotILDI$I$BP
40
Compressing/Serpentine Pipelines
Lecture 6: Superscalar Decode and Other Pipelining
DispROBFull!
1 entryfree
RenDecDecRotILDI$I$BPBetter “flow”, but much more
complex since need to track howmany insts can advance per stage
41
Lots o’ Stalls• I$ miss, ITLB miss• Decoder limitations
– x86 4-1-1 limit– branch limits (max/cycle, max taken/cycle)
• Renamer – out of physical registers
Lecture 6: Superscalar Decode and Other Pipelining
42
Smaller Control Domains• Separate long pipeline into multiple smaller
pipelines
Lecture 6: Superscalar Decode and Other Pipelining
BPred I$ I$ Dec Dec Dec Ren Alloc Sched
BPred I$ I$ Dec Dec Dec
Ren Alloc Sched
43
Smaller Control Domains (2)
• Non-decoupled pipe needed logic to simultaneously control ~10 stages
• Decoupled pipe needs multiple control logic circuits– each only needs to interact with ~5 stages (~3
real stages, plus the queue ahead and behind)
Lecture 6: Superscalar Decode and Other Pipelining
Pipeline ControlLogic
Non-decoupledI$ Dec Dec Dec Ren
Pipeline ControlLogic
Decoupled
XX
No direct control logicfor stages outside of
local pipeline
44
Smaller Control Domains (3)• Queues can effectively add more pipeline
stages
Lecture 6: Superscalar Decode and Other Pipelining
cycleboundary
previous stage next stage
Inter-pipe queueenqueue
logic latch dequeuelogic
previous stage next stage
• Avoid this by writing and reading in the same cycle (affects timing, complexity)
45
Queues provide Smoothing• Approximation to serpentine pipes
(compress only at certain locations – i.e., the queues)
• Different levels of decoupling possible depending on frequency target, power, complexity tolerance
Lecture 6: Superscalar Decode and Other Pipelining
BPred I$ I$ Dec Ren Sched
The “SimpleScalar” pipeline
Note: RS is effectivelya queue (more later)
46
Different Clocking Domains• Decoupling the pipe allows each segment
to operate independently (local control)• Also means each can run at different
speeds (P4)
Lecture 6: Superscalar Decode and Other Pipelining
TC TC Dec … Alloc
Sched … WB(ROB)
Commit(IAQ)
(uopQ)1x freq
(3 uops/Mclk)½x freq(6 uops/2Mclk’s 3 uops/Mclk)
2x freq(2 uops/Fclk 4 uops/Mclk)
1x freq(3 uops/Mclk)