1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

1

Computer architectures M

Pentium IV-XEON

2

Pentium IV block scheme

32 bytes parallel

Four access ports to the EU

3

Fetch/Decode

I-TLB

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Trace cache BTB

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store

Integer Register File

ALU ALU ALU ALULoadAGU

StoreAGU

L2 C

ache an

d C

ontrol

L1 D-Cache and D-TLB

AGU Address Generation UnitBTB Branch Target BufferI-TLB Instruction TLBD-TLB Data TLB

3.2 GB/s Interfaccia

400> MHz FSBParall. Bus 64 bit

8 Bytes/transaction(up to 1600 MHz too)

AGUAddress

GenerationUnit

Pentium IV block schemeNetburst Architecture

u-ops are directly sent to the FU RS and at the same time inserted into the ROB

BTBCISC

No instruction cache I !!!

4

General features

But• Increased stage number increases the branch penalty • Increased clock frequency requires sometimes multiple cycles for the

ALU operations and therefore delay for the waiting instructions

• Very high clock frequency (up to 4GHz)

• Increased pipeline stages number

• For the data still I level cache (set associative – 8 Kbyte four ways – 64 bytes line, haks the line width of the II level)

• II level cache: unified - 256KB 8 ways associative 256 bit parallel (32 bytes in parallel) 128 Bytes lline. One transfer for each clock edge (32 bytes - ¼ of the line). Bandwith 256 bit x 2 x fclock

• CPU clock frequency double of the rest of the CPU

• FSB 1,6 GHz

• ROB: 126 -m ops

• BTB: 4K entries.

• No L1 instruction cache

5

Trace Cache

BUT….

• The trace cache is “fragile”. In case of di miss the decoding is made one instruction at a time (a single decoder !) and it was verified that the hit rate was lee than 60% and in the 40%of the cases the system behaves as if only instruction at a time were processed. The trace cache was therefore later discarded.

NB: There is obviously a second BTB per the trace cache

•The trace cache has a twofold ways of operation. In «execute mode» provides the pipeline stages with the -m ops. In case of miss, it retrieves the needed instructions from L2, interpretes the branches (through the BTB), retrieves the instructions speculatively, decodes the instructions and generates a «segment» of -m ops inside. Two advantages: no bubble for the branches already predicted and no waste of cache lines (in the normal caches the bytes following a branch are not used because the fetch stops when a branch is encountered while here a line can contain both the branch and the speculated code). Obviously there is always the possibility of mispredictions.

•Since in the majority of cases the instructions are already decoded in the Pentium IV there are are no three decoders but only one because theoretically the need for decoding is reduced.

• In the trace cache there are in general 6 u-ops per line

•This architecture is member of the architectures with «predecoding» where in the I-cache decoded instructions (i.e. AMD Kx , Nehalem etc….)

• For complex instructions (>4 mops) instead of filling the trace cache, a tag is inserted which triggers the decoding ROM which provides the mops whenever such instructions are encountered. No much change for the efficiency: (1 clock delay): the pipeline is in any case activated..

•Trace cache size: about 12K u-ops (equivalente – for Intel – to a 16-18K instruction cache)

6

It was said

• Intel ha increased the clock frequency (and therefore the number of pipelines stages) only for marketing reasons…

• By reducing the transistors size the clock speed can be increased but the functionality too could be improved. This was not the case with the PIV

7

Technological characteristics

Tecnology 0.13 micron (130 nm) 217 mm2 – 42 millions of di transistors 423 pin socket 423 Power supply 1.7V@32A 52 Watts a 1.4 GHz –

much greater at 4 GHz ! – great cooling problem

8


9

Socket 423


10

Execution core

Arithmeticsection

Memory section(load e store)

The dispatch ports (electrical paths to the FUs) are only 4 (0-3)

Simple executionUnit

Complex executionUnit

RAT

(it is always possible to imagine that the instructions are sent directly to the ports with a reference number and after execution to the ROB)

Port 2 e 3

11

PlV has 2 (fast) x 2 (u-ops/clock) +1 (slow) ALU = 5 virtual ALUs!

Execution core

• Because of the increased clock frequency the RS should have been increased to avoid the underrun : Intel preferred to add the trace cache.

• Here too3 -m ops per clock are extracted. ROB size much greater (126 slots)

• Only 4 dispatch ports (port 0-3 against the 5 ports of P6)

• Up to 6 -m ops per clock are sent to the 4 ports (possible because the speed is twice that of the clock– positive and negative edges - 2x2clock to high speed ALU + 1 slow ALU + 1LD/ST)

• Execution port 0: Integer additions and subtractions, logical operations. Branch evaluation

and store data into the ROB (positive and negative edges) Floating point and SSE set move

• Execution port 1 Fast integer ALU (only additions and subtractions) Slow ALU ALU (complex integer operations. i.e. shift and rotations which

cannot be executed in a single clock) FP e SSE

• Execution port 2 e 3 Load e Store

12

Branch Prediction

• The static predictor is very simple

• branch conditional forward not taken• branch conditional backward taken

• As for P6 there are two branch predictors: one dynamic ad one static

• The BTB consists of 4096 entries (8 ways set associative) and a secret released by Intel. In any case it is not enough if there are many nested loops

13

TC Nxt IP Trace cache next instruction pointer (and branch verification)TC Fetch Trace cache fetchDrive Driving stage (only for electronic reasons to allow the signals

propagaition solo elettronica because of fclock ). Many stages requires two clocks because of the high clock frequency (for instance the transfer from the EU to the ROB)

Pentium IV Pipeline

At the end of the first 5 stages PIV inserts three u-ops in a u-ops queue (in order) for speed compensation

Stages 1-5

14

Scheduler -m ops scheduling (m-ops executed only when operands ready Each scheduler (10-12 u-ops) selects the u-ops to sent to the relative ports (in order)

Memory Scheduler - Load/Store Unit (LSU). Fast ALU Scheduler - Arithmetic-Logic Unit (simple integer and

logical operations) Slow ALU/General FPU Scheduler – other ALU functions and the

majority of the floating-point. Simple FP Scheduler – simple FP operations and FP with memory

Stages 6-12

128Integer registers

128Floating

point registers

Port: bus verso le Un. Fun.

Pentium IV Pipeline

Allocator/Register renaming Internal registers allocation (128 x 2 - 128 integer and 128 FP – dynamically allocated ) and ROB insertion. If LOAD or STORE => buffer register allocation and reservation for the FU. RAT: 128 integer registers, 128 FP, and 128 VPR (vector proc./FP)

Queue ROB – mops (126 slots). The slots number is double: one set LOAD/STORE and the other for the others

15

Pentium IV RAT

EAXEBXECXEDXESIEDIESPEBP

0 012..............................127

RAT

16

Dispatcher (13-14) two stages: the -m ops (ready) are sent to one of the 4 ports (2x2) +1 + 1 = 6 – the execution units use positive and negative clock edges)

RF (15-16) Register file. The required data are retrieved from the sources and loaded into the EU registers

Pipeline Pentium IV

(SSE = Streaming SIMD Extension)

17

Flags Execution resultBrChk Branch Check: when mispredicted 19 stadi back !Drive Rewrite and drive!

Ex Execute (17)

Stages18-20

Pipeline Pentium IV

3,5-4 GHZ Pentium IV has a 31 stages pipeline !!

18


Fetch/Decode (ROM)

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

BTB512

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store



StoreAGU

L2 C

ache an

d C

ontrol


The instructions are retrieved from the L2(128=2 x 64(=256 bit) Bytes – a single clock period because the bus uses both clock edges ) – Memory address either from BTB (branch) or I-TLB (4096 entries- 4 ways associative)

Instruction sequence

2x 256 bit64 bytes

Trace cache BTB

BTBCISC

BTB and I-TLB

m-ops 118 bit long in the trace cache.

19


BTB and I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

BTB

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store



StoreAGU

L2 C

ache an

d C

ontrol



BTB512

BTBCISC

BTB: 8 vie 512 lines 8 ways associative – 4 bit Yeh predictionStatic prediction:

Taken backwardUntaken forward

BTB512

Trace cache BTB

20


BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

BTB

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store



StoreAGU

L2 C

ache an

d C

ontrol


Trace cache BTB

BTBCISC


21

Allocation; 3 m-ops per clock edgeRegister allocation integer or FP (2X128)


BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store



StoreAGU

L2 C

ache an

d C

ontrol


(Drive)

BTBCISC

BTB512

Trace cache BTB


22


BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store



StoreAGU

L2 C

ache an

d C

ontrol


Two schedulers double FIFO queueIn order within a queue OOO between the two queues.

BTBCISC

BTB512

Trace cache BTB


23


BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store



StoreAGU

L2 C

ache an

d C

ontrol


m-ops to the FU when available. Up to 6 m-ops per clock. If addressed to the same FU FIFO order

The scheduler handles integer and FP u- ops, mispredicted branches and MMX

BTBCISC

BTB512

Trace cache BTB


24


BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store



StoreAGU

L2 C

ache an

d C

ontrol


Access to integer RF and FP and insertion into the ROBAccess to the data cache

BTBCISC

BTB512

Trace cache BTB


25


BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store



StoreAGU

L2 C

ache an

d C

ontrol


DTLB 64 entries full associative - L1 8KB, 4 way associative, 64 bytes lines, write-through, non blocking (out of order load!)

BTBCISC

BTB512

Trace cache BTB


26


BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store



StoreAGU

L2 C

ache an

d C

ontrol


Branch check: a stage for branch verification and match with the prediction. Cache and CISC BTB prediction update

BTBCISC

BTB512

Trace cache BTB


27

Complexity vs efficiency

Cost and power consumption increase

An increase 5-fold of the efficiency required an increase of the complexity 15-fold and and an increase 18-fold of the power consumption

(Normalized against 486 performance =1)

5

15

18

Increase more than proportional of the complexity for the efficiency increase (superscalar, branches prediction, OO execution , caches, clock frequency increase)

28

Multiprocessor

At pipelines and FU levels there can be inefficiencies (for instance a program or a thread doen’t use a FU or a pipeline stage in a clock period)

…but transistors number increase…But …

Modern processors execute threads (which belong either to the same program or to different programs or to the OS)

29

Pentium IV

LP LP

BUS

Each processor can execute its own set of instructions, be interrupted, stop (HALT) etc. Each thread is executed OOO

Hyperthreading

LP Logic Processor

Threading is the ability of executing at the same time multiple instruction sequences (see Java) which belong either to the same program or to different (similar to multiprocessor). In the previous processors this required multiple concurrent processors.

Starting from the PENTIUM IV the architecture allows the definition of two «logical» processors each one having its own register sets and able to interface with the bus.

30

Hyperthreading

Obviously multithreads can be executed on conventional processors too with context switches and through time slice policies.

The Hyperthreading, on the contrary, allows the the execution of two threads without context switch

31

HyperthreadingImplications

The executed processes behave as if they were executed on different processors.

The processor must maintain a copy of the architectural state od each logial processor which use the same hardware resources (in practice instructiond and operations must be «tagged» that is they have a tag indicating the processor the belong to)

The architectural state consists of all the general registers and of the machine registers

The two logical processors share the caches (physically addressed) the BTB (as the caches), the FU and the control circuits (NOT the TLB)

Improperly stated the hyperthreading can be though as a pipeline level multiprogramming

The die complexity increase is about 5%

32

Performance

20% efficiency increase against 5% complexity increase. In general the efficiency increase is between 16% and 28%

34

Xeon

The efficiency of the processor when only one thread is executed must be the same of the processor without hyprethreading. This implies that the resources must be recombined

Design criteria

High reduction of the die size increase (less than 5%)

The stall of a «virtual» processor (caches misses, branch predisction error, hazards etc.) can’t block the oher processor. There are queues between the stages and no virtual processor can use the resources 100%

Pipeline XEON

Thread 1

Thread 2

Trace cache

P IV physical registers

Classical pipeline

35

Xeon

Two distinct sets of registers: GPRs, Segment, Flag, EIP, x87, MMX, SSE, TR, Machine registers etc.

Trace cache hit

The yellow and red colours refer to the two threads

Common stage Partitioned queues

distinct ITLBs (two different paging systems for different processes))

Instruction pointersdistinti

All’allocatore

Trace cache miss

ROM

The decoding is not alternate per clock but for instructions groups

The trace cache is equally subdivided per thread (the m-ops are «tagged» to avoid to replace too many m-ops of one thread)

Common to the two threads (physical address)

36 RAT registers are obviously replicated

XeonExecution Trace In case of acces request overlap arbitration and alternative assignment

per clock. If a thread is stalles the other executes regularly

The TC contains 12K u- and is 8 ways set associative. All entries are “tagged” by thread (replacement per thread)

Precise replacement LRU (shift register)

Caches L1 (data), L2 e L3 (if any) are unified (physical addresses)

L1 (data cache) is 4 ways associative, very fast, 64 bytes lines and is “write through” to L2.

There is a common DTLB for L1 4 ways with the entries “tagged”. Pages are 4 KB or 4 MB

L2and L3 are 8 ways associative with 128 bytes lines

Arbitration for L2 access FCFS with requests queue where at least one slot per thread is reserved

Output L2 queues different per thread each with 64 bytes slots

BTB partially shared (physical addresses). The branch hystory buffers are different. The PHT is unique with tagged entries. RSB are obviously different (12 slots per thread)

Unified caches can triggers conflicts but also benefits (i.e. the prefetched instructions of one thread could be used by the other or a thread can use data produced by the others)

37

If there are m-ops available for each clock the service is alternated. If a thread ha reached the limit level is stalled: the 50% is never violated. By so doing the “fairness” is granted and the deadlocks are avoided

From theTraceCache

Pipeline Out Of Order Split clockaccess

As in the Pentium IV there is a microcode ROM per complex instructions

XeonDecode

The decode logic alternates the queues of the threads and maintain two copies of the necessary information

Allocator

It assigns the u-ops to the 128 slots of the reorder buffer (tagged – for the retirement in order), the integer and 128 floating registers, the load (48) and store (24) buffers. Buffer are partitioned so that each thread can use only 50%

Documents

1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU