ECE 4100/610 0 Guest Lecture: P6 & NetBurst Microa rchitecture

1

ECE4100/6100H-H. S. Lee

ECEECE4100/6104100/6100 0 Guest Guest Lecture:Lecture:

P6P6 & NetBurst & NetBurst

MicroaMicroarchitecturerchitecture

Prof. Hsien-Hsin Sean LeeProf. Hsien-Hsin Sean Lee

School of ECESchool of ECE

Georgia Institute of Georgia Institute of TechnologyTechnology

February February 11, 200311, 2003

2

ECE4100/6100H-H. S. LeeWhy study P6 from last

millennium? A paradigm shift from Pentium A RISC core disguised as a CISCHuge market success:

Microarchitecture And stock price

Architected by former VLIW and RISC folks Multiflow (pioneer in VLIW architecture for super-

minicomputer) Intel i960 (Intel’s RISC for graphics and embedded

controller)Netburst (P4’s microarchitecture) is based on P6

3

ECE4100/6100H-H. S. Lee

P6 Basics One implementation of IA32 architecture Super-pipelined processor 3-way superscalar In-order front-end and back-end Dynamic execution engine (restricted dataflow) Speculative execution P6 microarchitecture family processors include

Pentium Pro Pentium II (PPro + MMX + 2x caches—16KB I/16KB D) Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD) Celeron (without MP support) Later P-II/P-III/Celeron all have on-die L2 cache

4

ECE4100/6100H-H. S. Lee

x86 Platform Architecture

System System Memory Memory (DRAM)(DRAM)

MCHMCH

Front-Side Front-Side BusBus

PCI USB I/O

GraphicsGraphicsProcessor Processor

LocalFrameBuffer

AGP

(SRAM)(SRAM)L2 CacheL2 Cache

Back-SideBack-Side

BusBus

P6 CoreP6 Core

Host ProcessorHost Processor

L1L1CacheCache

(SRAM)(SRAM)

GPUGPU

ICHICH

chipsetchipset

On-die or on-package

5

ECE4100/6100H-H. S. Lee

Pentium III Die Map EBL/BBL – External/Backside Bus logic MOB - Memory Order Buffer Packed FPU - Floating Point Unit for SSE IEU - Integer Execution Unit FAU - Floating Point Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit (L1) PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Floating Point unit RS - Reservation Station BTB - Branch Target Buffer TAP – Test Access Port IFU - Instruction Fetch Unit and L1 I-Cache ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer

6

ECE4100/6100H-H. S. LeeISA Enahncement (on top of

Pentium) CMOVcc / FCMOVcc r, r/m

Conditional moves (predicated move) instructions Based on conditional code (cc)

FCOMI/P : compare FP stack and set integer flags RDPMC/RDTSC instructions Uncacheable Speculative Write-Combining (USWC) —weakly

ordered memory type for graphics memory MMX in Pentium II

SIMD integer operations SSE in Pentium III

Prefetches (non-temporal ntanta + temporal t0t0, t1t1, t2t2), sfence SIMD single-precision FP operations

7

ECE4100/6100H-H. S. Lee

P6 Pipelining

1111 1212 1313 1414 1515 1616 1717

2020 2121 2222

Nex

t IP

Nex

t IP

I-Cac

heI-C

ache

ILD

ILD

Rot

ate

Rot

ate

Dec

1D

ec1

Dec

2D

ec2

Br D

ecB

r Dec

RS

Writ

eR

S W

rite

RA

TR

AT

IDQ

IDQ

In-order FEIn-order FE

3131 3232 3333

8181 8282

.... ....

8383

Exec

2Ex

ec2

Exec

nEx

ec n

Multi-cycleMulti-cycle inst inst pipelinepipeline

3131 3232 3333

8181 8282

4242 4343

8383

AG

UA

GU

DC

ache

1D

Cac

he1

DC

ache

2D

Cac

he2

Non-blocking Non-blocking memory pipelinememory pipeline

3131 3232 3333

8282 8383

RS

schd

RS

schd

RS

Dis

pR

S D

isp

Exec

/ W

BEx

ec /

WB

Single-cycleSingle-cycle inst inst pipelinepipeline

83: Data WB83: Data WB82: Int WB82: Int WB81: Mem/FP WB81: Mem/FP WB

FE in

-ord

er b

ound

ary

FE in

-ord

er b

ound

ary

Ret

irem

ent i

n-or

der b

ound

ary

Ret

irem

ent i

n-or

der b

ound

ary

9191 9292 9393

Ret

ptr

wr

Ret

ptr

wr

Ret

RO

B rd

Ret

RO

B rd

RR

F w

rR

RF

wr

…

…

…

… ……..

RS Scheduling RS Scheduling DelayDelay

ROB Scheduling ROB Scheduling DelayDelay

MOB Scheduling MOB Scheduling DelayDelay

IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2

3131 3232 3333

8181 8282

4242 4343

8383

AG

UA

GU

MO

BM

OB

blk

blk

MO

B w

rM

OB

wr

4040 4141 4242 4343

MO

B d

isp

MO

B d

isp

DC

ache

1D

Cac

he1

Dca

che2

Dca

che2

Mob

wak

eup

Mob

wak

eup

Blocking Blocking memory memory pipelinepipeline

8

ECE4100/6100H-H. S. Lee

Instruction Fetch UnitInstruction Fetch Unit

P6 Microarchitecture

BTB/BACBTB/BAC

Instruction Fetch UnitInstruction Fetch Unit

Bus interface unitBus interface unit

InstructionInstruction

DecoderDecoder

InstructionInstruction

DecoderDecoder

Register Register Alias TableAlias Table

AllocatorAllocatorMicrocode Microcode SequencerSequencer

Reservation Reservation StationStation

ROB & ROB & Retire RFRetire RF

AGUAGU

MMXMMX

IEU/JEUIEU/JEUIEU/JEUIEU/JEU

FEUFEU

MIUMIU

Memory Memory Order BufferOrder Buffer

Data Cache Data Cache Unit (L1) Unit (L1)

External busExternal bus

Chip boundaryChip boundary

Control Control FlowFlow

(Restricted)(Restricted)DataDataFlowFlowInstruction Fetch Cluster

Issue Cluster

Out-of-orderCluster

MemoryCluster

Bus Cluster

9

ECE4100/6100H-H. S. Lee

Instruction Fetching Unit

IFU1: Initiate fetch, requesting 16 bytes at a time IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction IFU3: Align instructions to 3 decoders in 4-1-1 format

Streaming Buffer

Instruction Cache

Victim Cache

Instruction TLB

datadata addraddr

P.AddrP.Addr

Branch Target Buffer

Next PCNext PCMuxMux

Other fetch Other fetch requestsrequests

Lin

ear

Add

ress

Lin

ear

Add

ress

Select Select muxmux

ILDILDLength Length marksmarks

Instruction Instruction rotatorrotator

Instruction Instruction bufferbuffer

#bytes #bytes consumed consumed by IDby ID

Prediction Prediction marksmarks

10

ECE4100/6100H-H. S. Lee

Dynamic Branch Prediction

Similar to a 2-level PAs design Associated with each BTB entry W/ 16-entry Return Stack Buffer 4 branch predictions per cycle (due to

16-byte fetch per cycle)

Static prediction provided by Branch Address Calculator when BTB misses (see prior slide)

512-entry BTB 512-entry BTB 1 1 0

Branch History RegisterBranch History Register(BHR)(BHR)

0000 0001 0010

1111 1110

Pattern History Tables Pattern History Tables (PHT)(PHT)

Prediction

Rc: Branch ResultRc: Branch Result2-bit sat. counter

11 00

1 10

Spec. updateSpec. update

New (spec) historyNew (spec) history

1101

W0W0 W1W1 W2W2 W3W3

11

ECE4100/6100H-H. S. Lee

Static Branch Prediction

BTB miss?BTB miss?

PC-relative?PC-relative?

Conditional?Conditional?

Backwards?Backwards?

Return?Return?

Unconditional Unconditional PC-relative?PC-relative?

NoNoNoNo

NoNo NoNo

NoNo

NoNo

YesYes

YesYes

YesYes

YesYes

YesYes

YesYes

BTB’s BTB’s decisiondecision

TakenTaken

TakenTakenTakenTaken

TakenTaken

TakenTaken

Indirect Indirect jumpjump

Not TakenNot Taken

12

ECE4100/6100H-H. S. Lee

X86 Instruction Decode

4-1-1 decoder Decode rate depends on instruction alignment DEC1: translate x86 into micro-operation’s (ops) DEC2: move decoded ops to ID queue MS performs translations either

Generate entire op sequence from microcode ROM Receive 4 ops from complex decoder, and the rest from microcode ROM

complexcomplex(1-4)(1-4)

complexcomplex(1-4)(1-4)

simplesimple(1)(1)

simplesimple(1)(1)

simplesimple(1)(1)

simplesimple(1)(1)

IFU3IFU3

Micro-Micro-instruction instruction sequencer sequencer

((MSMS))

Instruction decoder queueInstruction decoder queue(6 (6 ops)ops)

Next 3 instNext 3 inst #Inst to dec#Inst to dec

S,S,SS,S,S 33

S,S,CS,S,C First 2First 2

S,C,SS,C,S First 1First 1

S,C,CS,C,C First 1First 1

C,S,SC,S,S 33

C,S,CC,S,C First 2First 2

C,C,SC,C,S First 1First 1

C,C,CC,C,C First 1First 1

S: SimpleS: SimpleC: ComplexC: Complex

13

ECE4100/6100H-H. S. Lee

Allocator

The interface between in-order and out-of-order pipelines

Allocates “3-or-none” ops per cycle into RS, ROB “all-or-none” in MOB (LB and SB)

Generate physical destination PdstPdst from the ROB and pass it to the Register Alias Table (RAT)

Stalls upon shortage of resources

14

ECE4100/6100H-H. S. Lee

Register Alias Table (RAT)

Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 op per cycle 40 80-bit physicalphysical registers embedded in the ROB (thereby, 6 bit to specify PSrcPSrc) RAT looks up physical ROB locations for renamed sources based on RRF bit

In-o

rder

que

ueIn

-ord

er q

ueue

FP FP TOS TOS AdjustAdjust

FP FP RAT RAT ArrayArray

Integer Integer RAT RAT ArrayArray

Logical SrcLogical Src

Int a

nd F

P O

verri

des

Int a

nd F

P O

verri

des

Array Array Physical Physical Src (Psrc)Src (Psrc)

RAT RAT PSrc’sPSrc’s

Physical ROB PointersPhysical ROB Pointers

AllocatorAllocator

2525

22

ECXECX

1515

EAXEAX

EBXEBX

ECXECX

EDXEDX

Renaming ExampleRenaming Example

ROBROBRRFRRF

RRFRRF PSrcPSrc

00

00

00

11

15

ECE4100/6100H-H. S. LeePartial Register Width

Renaming

32/16-bit accesses: Read from low banklow bank Write to both banks

8-bit RAT accesses: depending on which Bank is being written

In-o

rder

que

ueIn

-ord

er q

ueue

FP FP TOS TOS AdjustAdjust

FP FP RAT RAT ArrayArray

Logical SrcLogical Src

Int a

nd F

P O

verri

esIn

t and

FP

Ove

rries

Array Array Physical Physical SrcSrc

RAT RAT Physical SrcPhysical Src

Physical ROB Pointers from AllocatorPhysical ROB Pointers from Allocator

op0: MOV AL = (a)op0: MOV AL = (a)op1: MOV AH = (b)op1: MOV AH = (b)op2: ADD AL = (c)op2: ADD AL = (c)op3: ADD AH = (d)op3: ADD AH = (d)

Integer Integer RAT RAT ArrayArray

INT Low Bank INT Low Bank (32b/16b/L): (32b/16b/L): 8 entries8 entries

INT High Bank (H): INT High Bank (H): 4 entries4 entries

Size(2)Size(2) RRF(1)RRF(1) PSrc(6)PSrc(6)

AllocatorAllocator

16

ECE4100/6100H-H. S. Lee

Partial Stalls due to RAT

Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register followed by a larger (e.g. 32-bit) read

Partial flags stalls: Occurs when a subsequent instruction read more flags than a prior unretired instruction touches

EAXEAXAXAX writewritereadread

MOVB AL, m8 ; MOVB AL, m8 ; ADD EAX, m32 ; stallADD EAX, m32 ; stall

Partial register stallsPartial register stalls

XOR EAX, EAX XOR EAX, EAX MOVB AL, m8 ; MOVB AL, m8 ; ADD EAX, m32 ; no stallADD EAX, m32 ; no stall

SUB EAX, EAX SUB EAX, EAX MOVB AL, m8 ; MOVB AL, m8 ; ADD EAX, m32 ; no stallADD EAX, m32 ; no stall

Idiom Fix (1)Idiom Fix (1)

Idiom Fix (2)Idiom Fix (2)

CMP EAX, EBX CMP EAX, EBX INC ECX INC ECX JBE XX ; stallJBE XX ; stall

Partial flag stalls (1)Partial flag stalls (1)

JBEJBE reads both ZFZF and CFCF while INC affects (ZFZF,OF,SF,AF,PF)

LAHF LAHF loads low byte of EFLAGS EFLAGS

TEST EBX, EBX TEST EBX, EBX LAHF ; stallLAHF ; stall

Partial flag stalls (2)Partial flag stalls (2)

17

ECE4100/6100H-H. S. Lee

Reservation Stations

Gateway to execution: binding max 5 op to each port per cycle 20 op entry buffer bridging the In-order and Out-of-order engine RS fields include op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc. Oldest first FIFO scheduling when multiple ops are ready at the same cycle

Port 0Port 0

Port 1Port 1

Port 2Port 2

Port 3Port 3

Port 4Port 4

IEU0IEU0 FaddFadd FmulFmul ImulImul DivDiv

IEU1IEU1 JEUJEU

AGU0AGU0

AGU1AGU1

MOBMOB DCUDCU

ROBROB RRFRRF

PfaddPfadd

PfmulPfmul

PfshufPfshuf

WB bus 1WB bus 1

WB bus 0WB bus 0

Ld addrLd addr

St addrSt addr

LDALDA

STASTA

STDSTDSt dataSt data

Loaded dataLoaded data

RSRS

Retired Retired datadata

18

ECE4100/6100H-H. S. Lee

ReOrder Buffer A 40-entry circular buffer

Similar to that described in [SmithPleszkun85][SmithPleszkun85]

157-bit wide Provide 40 alias physical registers

Out-of-orderOut-of-order completion Deposit exception in each entry Retirement (or de-allocation)

After resolving prior speculation Handle exceptions thru MS Clear OOO state when a mis-predicted branch or

exception is detected 3 op’s per cycle in program orderin program order For multi-op x86 instructions: none or all (atomic)none or all (atomic)

ALLOCALLOC

RATRAT

RSRS

RRFRRFROBROB. . . . ..

MSMS

(exp) (exp) code assistcode assist

19

ECE4100/6100H-H. S. Lee

Memory Execution Cluster

Manage data memory accesses Address Translation Detect violation of access ordering

RS / ROBRS / ROB

LDLD STASTA STDSTD

DTLBDTLBDTLBDTLB

LDLD STASTADCUDCUDCUDCU

Load BufferLoad Buffer

Store BufferStore BufferEBLEBL

Memory Cluster BlocksMemory Cluster Blocks

Fill buffers in DCU (similar to MSHR [Kroft’81][Kroft’81]) for handling cache misses (non-blocking)

FBFB

20

ECE4100/6100H-H. S. Lee

Memory Order Buffer (MOB) Allocated by ALLOC A second order RS for memory operations 1 op for load; 2 op’s for store: Store Address (STA) and Store Data (STD) MOB

16-entry load buffer (LB) 12-entry store address buffer (SAB) SAB works in unison with

Store data buffer (SDB) in MIU Physical Address Buffer (PAB) in DCU

Store Buffer (SB): SAB + SDB + PAB Senior Stores

Upon STD/STA retired from ROB SB marks the store “seniorsenior” Senior stores are committed back in program orderprogram order to memory when bus idle or SB full

Prefetch instructions in P-III Senior loadSenior load behavior Due to no explicit architectural destination

21

ECE4100/6100H-H. S. Lee

Store Coloring

ALLOC assigns Store Buffer ID (SBID) in program order ALLOC tags loads with the most recent SBID Check loads against stores with equal or younger SBIDs for potential

address conflicts SDB forwards data if conflict detected

x86 Instructionsx86 Instructions op’sop’s store colorstore color mov (0x1220), ebxmov (0x1220), ebx std (ebx)std (ebx) 2 2

sta 0x1220sta 0x1220 2 2 mov (0x1110), eaxmov (0x1110), eax std (eax)std (eax) 3 3

sta 0x1100sta 0x1100 3 3 mov ecx, (0x1220)mov ecx, (0x1220) ldld 33 mov edx, (0x1280)mov edx, (0x1280) ldld 33 mov (0x1400), edxmov (0x1400), edx std (edx)std (edx) 4 4 sta 0x1400sta 0x1400 4 4 mov edx, (0x1380)mov edx, (0x1380) ldld 44

22

ECE4100/6100H-H. S. LeeMemory Type Range Registers

(MTRR) Control registers written by the system (OS) Supporting Memory TypesMemory Types

UnCacheable (UC) Uncacheable Speculative Write-combining (USWC or WC)

Use a fill buffer entry as WC buffer WriteBack (WB) Write-Through (WT) Write-Protected (WP)

E.g. Support copy-on-write in UNIX, save memory space by allowing child processes to share with their parents. Only create new memory pages when child processes attempt to write.

Page Miss Handler (PMH) Look up MTRR while supplying physical addresses Return memory types and physical address to DTLB

23

ECE4100/6100H-H. S. LeeIntel NetBurst

MicroarchitecturePentium 4’s microarchitecture, a post-P6 new generationOriginal target market: Graphics workstations, but … the

major competitor screwed up themselves…Design Goals:

Performance, performance, performance, … Unprecedented multimedia/floating-point performance

Streaming SIMD Extensions 2 (SSE2) Reduced CPI

Low latency instructionsHigh bandwidth instruction fetchingRapid Execution of Arithmetic & Logic operations

Reduced clock periodNew pipeline designed for scalability

24

ECE4100/6100H-H. S. Lee

Innovations Beyond P6Hyperpipelined technologyStreaming SIMD Extension 2 Enhanced branch predictorExecution trace cacheRapid execution engineAdvanced Transfer CacheHyper-threading Technology (in Xeon and Xeon MP)

25

ECE4100/6100H-H. S. Lee

Pentium 4 Fact Sheet IA-32 fully backward compatible Available at speeds ranging from 1.3 to ~3 GHz Hyperpipelined (20+ stages) 42+ million transistors 0.18 μ for 1.7 to 1.9GHz; 0.13μ for 1.8 to 2.8GHz; Die Size of 217mm2

Consumes 55 watts of power at 1.5Ghz 400MHz (850) and 533MHz (850E) system bus 512KB or 256KB 8-way full-speed on-die L2 Advanced Transfer Cache (up

to 89.6 GB/s @2.8GHz to L1) 1MB or 512KB L3 cache (in Xeon MP) 144 new 128 bit SIMD instructions (SSE2) HyperThreading Technology (only enabled in Xeon and Xeon MP)

26

ECE4100/6100H-H. S. LeeRecent Intel IA-32

Processors

27

ECE4100/6100H-H. S. Lee

Building Blocks of Netburst

Bus UnitBus Unit

Level 2 CacheLevel 2 Cache

Memory subsystemMemory subsystem

Fetch/Fetch/DecDec

ETCETCμμROMROM

BTB / Br Pred.BTB / Br Pred.

System busSystem bus

L1 Data CacheL1 Data Cache

Execution UnitsExecution Units

INT and FP Exec. UnitINT and FP Exec. Unit

OOO OOO logiclogic RetireRetire

Branch history updateBranch history update

Front-endFront-endOut-of-Order EngineOut-of-Order Engine

28

ECE4100/6100H-H. S. Lee

Pentium 4 MicroarchitectueBTB (4k entries)BTB (4k entries) I-TLB/PrefetcherI-TLB/Prefetcher

IA32 DecoderIA32 Decoder

Execution Trace CacheExecution Trace CacheTrace Cache BTBTrace Cache BTB

(512 entries)(512 entries)

Code ROMCode ROM

op Queue op Queue

Allocator / Register RenamerAllocator / Register Renamer

INT / FP INT / FP op Queueop QueueMemory Memory op Queueop Queue

Memory Memory schedulerscheduler

INT Register File / Bypass NetworkINT Register File / Bypass Network FP RF / Bypass NtwkFP RF / Bypass Ntwk

AGUAGU AGUAGU 2x ALU2x ALU 2x ALU2x ALU Slow ALUSlow ALU

Ld addrLd addr St addrSt addr Simple Simple Inst.Inst.

Simple Simple Inst.Inst.

ComplexComplexInst.Inst.

FPFPMMX MMX SSE/2SSE/2

FP FP MoveMove

L1 Data Cache (8KB 4-way, 64-byte line, WT, 1 rd + 1 wr port)L1 Data Cache (8KB 4-way, 64-byte line, WT, 1 rd + 1 wr port)

FastFast Slow/General FP schedulerSlow/General FP scheduler Simple FPSimple FP

Quad Quad PumpedPumped

400M/533MHz 400M/533MHz 3.2/4.3 GB/sec3.2/4.3 GB/sec

BIUBIU

U-L2 Cache U-L2 Cache 256KB 8-way256KB 8-way128B line, WB128B line, WB

48 GB/s 48 GB/s @[email protected] bits256 bits

64 bits64 bits64-bit 64-bit

SystemSystemBusBus

29

ECE4100/6100H-H. S. Lee

Pipeline Depth Evolution

PREFPREF DECDEC DECDEC EXECEXEC WBWB

P5 MicroarchitectureP5 Microarchitecture

IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2

P6 MicroarchitectureP6 Microarchitecture

TC NextIPTC NextIP TC FetchTC Fetch DriveDrive AllocAlloc QueueQueueRenameRename ScheduleSchedule DispatchDispatch Reg FileReg File ExecExec FlagsFlags Br CkBr Ck DriveDrive

NetBurst MicroarchitectureNetBurst Microarchitecture

30

ECE4100/6100H-H. S. Lee

Execution Trace CachePrimary first level I-cache to replace conventional L1

Decoding several x86 instructions at high frequency is difficult, take several pipeline stages

Branch misprediction penalty is horrible lost 20 pipeline stages vs. 10 stages in P6lost 20 pipeline stages vs. 10 stages in P6

Advantages Cache post-decodepost-decode ops High bandwidth instruction fetching Eliminate x86 decoding overheads Reduce branch recovery time if TC hits

Hold up to 12,000 ops 6 ops per trace line Many (?) trace lines in a single trace

31

ECE4100/6100H-H. S. Lee

Execution Trace CacheDeliver 3 op’s per cycle to OOO engineX86 instructions read from L2 when TC misses (7+ cycle latency)TC Hit rate ~ 8K to 16KB conventional I-cache Simplified x86 decoder

Only one complex instruction per cycle Instruction > 4 op will be executed by micro-code ROM (P6’s MS)

Perform branch prediction in TC 512-entry BTB + 16-entry RAS With BP in x86 IFU, reduce 1/3 misprediction compared to P6 Intel did not disclose the details of BP algorithms used in TC and x86

IFU (Dynamic + Static)

32

ECE4100/6100H-H. S. Lee

Out-Of-Order Engine

Similar design philosophy with P6 uses Allocator Register Alias Table 128 physical registers 126-entry ReOrder Buffer 48-entry load buffer 24-entry store buffer

33

ECE4100/6100H-H. S. Lee

Register Renaming SchemesROB (40-entry)ROB (40-entry)

RRFRRF

DataData StatusStatus

EBXEBXECXECXEDXEDXESIESIEDIEDI

EAXEAX

ESPESPEBPEBP

RATRAT

P6 Register Renaming P6 Register Renaming

Allo

cate

d se

quen

tially

Allo

cate

d se

quen

tially


EAXEAX

ESPESPEBPEBP

Retirement RATRetirement RAT

NetBurst Register Renaming NetBurst Register Renaming

StatusStatus

Allo

cate

d se

quen

tially

Allo

cate

d se

quen

tially

. . . . ..

. . . . ..

. . . . ..

. . . . ..

DataData


EAXEAX

ESPESPEBPEBP

Front-end Front-end RATRAT RF (128-entry)RF (128-entry) ROB (126)ROB (126)

34

ECE4100/6100H-H. S. Lee

Micro-op Scheduling op FIFO queues

Memory queue for loads and stores Non-memory queue

op schedulers Several schedulers fire instructions to execution (P6’s RS) 4 distinct dispatch ports Maximum dispatch: 6 ops per cycle (2 fast ALU from Port 0,1 per cycle; 1 from

ld/st ports)

Exec Port 0Exec Port 0 Exec Port 1Exec Port 1 Load PortLoad Port Store PortStore Port

Fast ALUFast ALU(2x pumped)(2x pumped)

Fast ALUFast ALU(2x pumped)(2x pumped)

FP FP MoveMove

INTINTExecExec

FP FP ExecExec

Memory Memory LoadLoad

Memory Memory StoreStore

•Add/subAdd/sub•LogicLogic•Store DataStore Data•BranchesBranches

•FP/SSE MoveFP/SSE Move•FP/SSE StoreFP/SSE Store•FXCHFXCH

•Add/subAdd/sub •ShiftShift•RotateRotate

•FP/SSE AddFP/SSE Add•FP/SSE MulFP/SSE Mul•FP/SSE DivFP/SSE Div•MMXMMX

•LoadsLoads•LEALEA•PrefetchPrefetch

•StoresStores

35

ECE4100/6100H-H. S. Lee

Data Memory Accesses8KB 4-way L1 + 256KB 8-way L2 (with a HW prefetcher)Load-to-use speculation

Dependent instruction dispatched before load finishesDue to the high frequency and deep pipeline depth

Scheduler assumes loads always hit L1 If L1 miss, dependent instructions left the scheduler receive incorrect data

temporarily – mis-speculationmis-speculation Replay logic Replay logic – Re-execute the load when mis-speculated Independent instructions are allowed to proceed

Up to 4 outstanding load misses (= 4 fill buffers in original P6)Store-to-load forwarding buffer

24 entries Have the same starting physical address Load data size <= store data size

36

ECE4100/6100H-H. S. Lee

Streaming SIMD Extension 2P-III SSE (Katmai New Instructions: KNI)

Eight 128-bit wide xmmxmm registers (new architecture state) Single-precisionSingle-precision 128-bit SIMD FP

Four 32-bit FP operations in one instructionBroken down into 2 ops for execution (only 80-bit data in ROB)

64-bit SIMD MMX (use 8 mmmm registers — map to FP stack) Prefetch (nta, t0, t1, t2) and sfence

P4 SSE2 (Willamette New Instructions: WNI) Support Double-precision Double-precision 128-bit SIMD FP

Two 64-bit FP operations in one instructionThroughput: 2 cycles for most of SSE2 operations (exceptional examples: DIVPD

and SQRTPD: 69 cycles, non-pipelined.) Enhanced 128-bit SIMD MMX using xmmxmm registers

37

ECE4100/6100H-H. S. Lee

Examples of Using SSEX3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

opop opop opop opop

X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

opop

X0 op Y0X0 op Y0X3X3 X2X2 X1X1

Packed SP FP operationPacked SP FP operation(e.g. (e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))

Scalar SP FP operation Scalar SP FP operation (e.g. (e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))

xmm1xmm1

xmm2xmm2

xmm1xmm1

xmm1xmm1

xmm2xmm2

xmm1xmm1

Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)(e.g. (e.g. SHUFPS xmm1, xmm2, SHUFPS xmm1, xmm2, imm8imm8))

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0

Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)(e.g. (e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf1 0xf1) )

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

xmm1xmm1

Y3Y3 X0X0 X1X1Y3Y3

xmm2xmm2

xmm1xmm1

38

ECE4100/6100H-H. S. LeeExamples of Using SSE and

SSE2X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

opop opop opop opop

X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

opop

X0 op Y0X0 op Y0X3X3 X2X2 X1X1

Packed Packed SPSP FP operation FP operation(e.g. (e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))

Scalar Scalar SPSP FP operation FP operation (e.g. (e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))

xmm1xmm1

xmm2xmm2

xmm1xmm1

xmm1xmm1

xmm2xmm2

xmm1xmm1

Shuffle FP operation Shuffle FP operation (e.g. (e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0

Shuffle Shuffle FPFP operation (8-bit imm) operation (8-bit imm) (e.g. (e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf1 0xf1) )

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

xmm1xmm1

Y3Y3 X0X0 X1X1Y3Y3

xmm2xmm2

xmm1xmm1

X0X0

opop

Packed Packed DPDP FP operation FP operation(e.g. (e.g. ADDPDADDPD xmm1, xmm2xmm1, xmm2))

Scalar Scalar DPDP FP operation FP operation (e.g. (e.g. ADDSDADDSD xmm1, xmm2xmm1, xmm2))

xmm1xmm1

xmm2xmm2

xmm1xmm1

xmm1xmm1

xmm2xmm2

xmm1xmm1

Shuffle FP operation Shuffle FP operation (e.g. (e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))Shuffle Shuffle DPDP operation (2-bit imm) operation (2-bit imm)(e.g. (e.g. SHUFPD xmm1, xmm2, SHUFPD xmm1, xmm2, imm2imm2) )

X1X1

Y0Y0Y1Y1

X0 op Y0X0 op Y0X1 op Y1X1 op Y1

opop

X0X0X1X1

Y0Y0Y1Y1

X0 op Y0X0 op Y0X1 X1

opop

X0X0X1X1

Y0Y0Y1Y1

X1 or X0X1 or X0Y1 or Y0 Y1 or Y0

SSESSE

SSE2SSE2

39

ECE4100/6100H-H. S. Lee

HyperThreading In Intel Xeon Processor and Intel Xeon MP

ProcessorEnable Simultaneous Multi-Threading (SMT)

Exploit ILP through TLP (—Thread-Level Parallelism) Issuing and executing multiple threads at the same

snapshotSingle P4 Xeon appears to be 2 logical processors2 logical processorsShare the same execution resourcesArchitectural states are duplicated in hardware

40

ECE4100/6100H-H. S. LeeMultithreading (MT)

Paradigms

Thread 1Thread 1UnusedUnused

Exec

utio

n Ti

me

Exec

utio

n Ti

me

FU1FU1 FU2FU2 FU3FU3 FU4FU4

ConventionalConventionalSuperscalarSuperscalar

SingleSingleThreadedThreaded

SimultaneousSimultaneousMultithreadingMultithreading

Fine-grainedFine-grainedMultithreadingMultithreading(cycle-by-cycle(cycle-by-cycle

Interleaving)Interleaving)

Thread 2Thread 2Thread 3Thread 3Thread 4Thread 4Thread 5Thread 5

Coarse-grainedCoarse-grainedMultithreadingMultithreading

(Block Interleaving)(Block Interleaving)

Chip Chip MultiprocessorMultiprocessor

(CMP)(CMP)

41

ECE4100/6100H-H. S. LeeMore SMT commercial

processorsIntel Xeon Hyperthreading

Supports 2 replicated hardware contexts: PC (or IP) and architecture registers

New directions of usageHelper (or assisted) threads (e.g. speculative precomputation) Speculative multithreading

Clearwater (once called Xtream logic) 8 context SMT “network processor” designed by DISC architect (company no longer exists)

SUN 4-SMT-processor CMP?

42

ECE4100/6100H-H. S. Lee

Speculative Multithreading SMT can justify wider-than-ILP datapath But, datapath is only fully utilized by multiple threads How to speed up single-thread program by utilizing multiple threads? What to do with spare resources?

Execute both sides of hard-to-predictable branches Eager execution or Polypath execution Dynamic predication

Send another thread to scout ahead to warm up caches & BTB Speculative precomputation Early branch resolution

Speculatively execute future work Multiscalar or dynamic multithreading e.g. start several loop iterations concurrently as different threads, if data dependence

is detected, redo the work Run a dynamic compiler/optimizer on the side Dynamic verification

DIVA or Slipstream Processor

Documents

ECE 4100/610 0 Guest Lecture: P6 & NetBurst Microa rchitecture