36
1 Computer architectures M Pentium IV-XEON

1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

Embed Size (px)

Citation preview

Page 1: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

1

Computer architectures M

Pentium IV-XEON

Page 2: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

2

Pentium IV block scheme

32 bytes parallel

Four access ports to the EU

Page 3: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

3

Fetch/Decode

I-TLB

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Trace cache BTB

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store

Integer Register File

ALU ALU ALU ALULoadAGU

StoreAGU

L2 C

ache an

d C

ontrol

L1 D-Cache and D-TLB

AGU Address Generation UnitBTB Branch Target BufferI-TLB Instruction TLBD-TLB Data TLB

3.2 GB/s Interfaccia

400> MHz FSBParall. Bus 64 bit

8 Bytes/transaction(up to 1600 MHz too)

AGUAddress

GenerationUnit

Pentium IV block schemeNetburst Architecture

u-ops are directly sent to the FU RS and at the same time inserted into the ROB

BTBCISC

No instruction cache I !!!

Page 4: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

4

General features

But• Increased stage number increases the branch penalty • Increased clock frequency requires sometimes multiple cycles for the

ALU operations and therefore delay for the waiting instructions

• Very high clock frequency (up to 4GHz)

• Increased pipeline stages number

• For the data still I level cache (set associative – 8 Kbyte four ways – 64 bytes line, haks the line width of the II level)

• II level cache: unified - 256KB 8 ways associative 256 bit parallel (32 bytes in parallel) 128 Bytes lline. One transfer for each clock edge (32 bytes - ¼ of the line). Bandwith 256 bit x 2 x fclock

• CPU clock frequency double of the rest of the CPU

• FSB 1,6 GHz

• ROB: 126 -m ops

• BTB: 4K entries.

• No L1 instruction cache

Page 5: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

5

Trace Cache

BUT….

• The trace cache is “fragile”. In case of di miss the decoding is made one instruction at a time (a single decoder !) and it was verified that the hit rate was lee than 60% and in the 40%of the cases the system behaves as if only instruction at a time were processed. The trace cache was therefore later discarded.

NB: There is obviously a second BTB per the trace cache

•The trace cache has a twofold ways of operation. In «execute mode» provides the pipeline stages with the -m ops. In case of miss, it retrieves the needed instructions from L2, interpretes the branches (through the BTB), retrieves the instructions speculatively, decodes the instructions and generates a «segment» of -m ops inside. Two advantages: no bubble for the branches already predicted and no waste of cache lines (in the normal caches the bytes following a branch are not used because the fetch stops when a branch is encountered while here a line can contain both the branch and the speculated code). Obviously there is always the possibility of mispredictions.

•Since in the majority of cases the instructions are already decoded in the Pentium IV there are are no three decoders but only one because theoretically the need for decoding is reduced.

• In the trace cache there are in general 6 u-ops per line

•This architecture is member of the architectures with «predecoding» where in the I-cache decoded instructions (i.e. AMD Kx , Nehalem etc….)

• For complex instructions (>4 mops) instead of filling the trace cache, a tag is inserted which triggers the decoding ROM which provides the mops whenever such instructions are encountered. No much change for the efficiency: (1 clock delay): the pipeline is in any case activated..

•Trace cache size: about 12K u-ops (equivalente – for Intel – to a 16-18K instruction cache)

Page 6: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

6

It was said

• Intel ha increased the clock frequency (and therefore the number of pipelines stages) only for marketing reasons…

• By reducing the transistors size the clock speed can be increased but the functionality too could be improved. This was not the case with the PIV

Page 7: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

7

Technological characteristics

Tecnology 0.13 micron (130 nm) 217 mm2 – 42 millions of di transistors 423 pin socket 423 Power supply 1.7V@32A 52 Watts a 1.4 GHz –

much greater at 4 GHz ! – great cooling problem

Page 8: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

8

Technological characteristics

Page 9: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

9

Socket 423

Technological characteristics

Page 10: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

10

Execution core

Arithmeticsection

Memory section(load e store)

The dispatch ports (electrical paths to the FUs) are only 4 (0-3)

Simple executionUnit

Complex executionUnit

RAT

(it is always possible to imagine that the instructions are sent directly to the ports with a reference number and after execution to the ROB)

Port 2 e 3

Page 11: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

11

PlV has 2 (fast) x 2 (u-ops/clock) +1 (slow) ALU = 5 virtual ALUs!

Execution core

• Because of the increased clock frequency the RS should have been increased to avoid the underrun : Intel preferred to add the trace cache.

• Here too3 -m ops per clock are extracted. ROB size much greater (126 slots)

• Only 4 dispatch ports (port 0-3 against the 5 ports of P6)

• Up to 6 -m ops per clock are sent to the 4 ports (possible because the speed is twice that of the clock– positive and negative edges - 2x2clock to high speed ALU + 1 slow ALU + 1LD/ST)

• Execution port 0: Integer additions and subtractions, logical operations. Branch evaluation

and store data into the ROB (positive and negative edges) Floating point and SSE set move

• Execution port 1 Fast integer ALU (only additions and subtractions) Slow ALU ALU (complex integer operations. i.e. shift and rotations which

cannot be executed in a single clock) FP e SSE

• Execution port 2 e 3 Load e Store

Page 12: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

12

Branch Prediction

• The static predictor is very simple

• branch conditional forward not taken• branch conditional backward taken

• As for P6 there are two branch predictors: one dynamic ad one static

• The BTB consists of 4096 entries (8 ways set associative) and a secret released by Intel. In any case it is not enough if there are many nested loops

Page 13: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

13

TC Nxt IP Trace cache next instruction pointer (and branch verification)TC Fetch Trace cache fetchDrive Driving stage (only for electronic reasons to allow the signals

propagaition solo elettronica because of fclock ). Many stages requires two clocks because of the high clock frequency (for instance the transfer from the EU to the ROB)

Pentium IV Pipeline

At the end of the first 5 stages PIV inserts three u-ops in a u-ops queue (in order) for speed compensation

Stages 1-5

Page 14: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

14

Scheduler -m ops scheduling (m-ops executed only when operands ready Each scheduler (10-12 u-ops) selects the u-ops to sent to the relative ports (in order)

Memory Scheduler - Load/Store Unit (LSU). Fast ALU Scheduler - Arithmetic-Logic Unit (simple integer and

logical operations) Slow ALU/General FPU Scheduler – other ALU functions and the

majority of the floating-point.  Simple FP Scheduler – simple FP operations and FP with memory

Stages 6-12

128Integer registers

128Floating

point registers

Port: bus verso le Un. Fun.

Pentium IV Pipeline

Allocator/Register renaming Internal registers allocation (128 x 2 - 128 integer and 128 FP – dynamically allocated ) and ROB insertion. If LOAD or STORE => buffer register allocation and reservation for the FU. RAT: 128 integer registers, 128 FP, and 128 VPR (vector proc./FP)

Queue ROB – mops (126 slots). The slots number is double: one set LOAD/STORE and the other for the others

Page 15: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

15

Pentium IV RAT

EAXEBXECXEDXESIEDIESPEBP

0 012..............................127

RAT

Page 16: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

16

Dispatcher (13-14) two stages: the -m ops (ready) are sent to one of the 4 ports (2x2) +1 + 1 = 6 – the execution units use positive and negative clock edges)

RF (15-16) Register file. The required data are retrieved from the sources and loaded into the EU registers

Pipeline Pentium IV

(SSE = Streaming SIMD Extension)

Page 17: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

17

Flags Execution resultBrChk Branch Check: when mispredicted 19 stadi back !Drive Rewrite and drive!

Ex Execute (17)

Stages18-20

Pipeline Pentium IV

3,5-4 GHZ Pentium IV has a 31 stages pipeline !!

Page 18: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

18

3.2 GB/s Interfaccia

Fetch/Decode (ROM)

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

BTB512

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store

Integer Register File

ALU ALU ALU ALULoadAGU

StoreAGU

L2 C

ache an

d C

ontrol

L1 D-Cache and D-TLB

The instructions are retrieved from the L2(128=2 x 64(=256 bit) Bytes – a single clock period because the bus uses both clock edges ) – Memory address either from BTB (branch) or I-TLB (4096 entries- 4 ways associative)

Instruction sequence

2x 256 bit64 bytes

Trace cache BTB

BTBCISC

BTB and I-TLB

m-ops 118 bit long in the trace cache.

Page 19: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

19

3.2 GB/s Interfaccia

BTB and I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

BTB

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store

Integer Register File

ALU ALU ALU ALULoadAGU

StoreAGU

L2 C

ache an

d C

ontrol

L1 D-Cache and D-TLB

Instruction sequence

BTB512

BTBCISC

BTB: 8 vie 512 lines 8 ways associative – 4 bit Yeh predictionStatic prediction:

Taken backwardUntaken forward

BTB512

Trace cache BTB

Page 20: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

20

3.2 GB/s Interfaccia

BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

BTB

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store

Integer Register File

ALU ALU ALU ALULoadAGU

StoreAGU

L2 C

ache an

d C

ontrol

L1 D-Cache and D-TLB

Trace cache BTB

BTBCISC

Instruction sequence

Page 21: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

21

Allocation; 3 m-ops per clock edgeRegister allocation integer or FP (2X128)

3.2 GB/s Interfaccia

BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store

Integer Register File

ALU ALU ALU ALULoadAGU

StoreAGU

L2 C

ache an

d C

ontrol

L1 D-Cache and D-TLB

(Drive)

BTBCISC

BTB512

Trace cache BTB

Instruction sequence

Page 22: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

22

3.2 GB/s Interfaccia

BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store

Integer Register File

ALU ALU ALU ALULoadAGU

StoreAGU

L2 C

ache an

d C

ontrol

L1 D-Cache and D-TLB

Two schedulers double FIFO queueIn order within a queue OOO between the two queues.

BTBCISC

BTB512

Trace cache BTB

Instruction sequence

Page 23: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

23

3.2 GB/s Interfaccia

BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store

Integer Register File

ALU ALU ALU ALULoadAGU

StoreAGU

L2 C

ache an

d C

ontrol

L1 D-Cache and D-TLB

m-ops to the FU when available. Up to 6 m-ops per clock. If addressed to the same FU FIFO order

The scheduler handles integer and FP u- ops, mispredicted branches and MMX

BTBCISC

BTB512

Trace cache BTB

Instruction sequence

Page 24: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

24

3.2 GB/s Interfaccia

BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store

Integer Register File

ALU ALU ALU ALULoadAGU

StoreAGU

L2 C

ache an

d C

ontrol

L1 D-Cache and D-TLB

Access to integer RF and FP and insertion into the ROBAccess to the data cache

BTBCISC

BTB512

Trace cache BTB

Instruction sequence

Page 25: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

25

3.2 GB/s Interfaccia

BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store

Integer Register File

ALU ALU ALU ALULoadAGU

StoreAGU

L2 C

ache an

d C

ontrol

L1 D-Cache and D-TLB

DTLB 64 entries full associative - L1 8KB, 4 way associative, 64 bytes lines, write-through, non blocking (out of order load!)

BTBCISC

BTB512

Trace cache BTB

Instruction sequence

Page 26: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

26

3.2 GB/s Interfaccia

BTB & I-TLB

Fetch/Decode

Trace Cache

Rename Allocation

m-ops Queues

m-codeROM

Schedulers

FP Register File

FMulFaddMMX

FP moveFP store

Integer Register File

ALU ALU ALU ALULoadAGU

StoreAGU

L2 C

ache an

d C

ontrol

L1 D-Cache and D-TLB

Branch check: a stage for branch verification and match with the prediction. Cache and CISC BTB prediction update

BTBCISC

BTB512

Trace cache BTB

Instruction sequence

Page 27: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

27

Complexity vs efficiency

Cost and power consumption increase

An increase 5-fold of the efficiency required an increase of the complexity 15-fold and and an increase 18-fold of the power consumption

(Normalized against 486 performance =1)

5

15

18

Increase more than proportional of the complexity for the efficiency increase (superscalar, branches prediction, OO execution , caches, clock frequency increase)

Page 28: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

28

Multiprocessor

At pipelines and FU levels there can be inefficiencies (for instance a program or a thread doen’t use a FU or a pipeline stage in a clock period)

…but transistors number increase…But …

Modern processors execute threads (which belong either to the same program or to different programs or to the OS)

Page 29: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

29

Pentium IV

LP LP

BUS

Each processor can execute its own set of instructions, be interrupted, stop (HALT) etc. Each thread is executed OOO

Hyperthreading

LP Logic Processor

Threading is the ability of executing at the same time multiple instruction sequences (see Java) which belong either to the same program or to different (similar to multiprocessor). In the previous processors this required multiple concurrent processors.

Starting from the PENTIUM IV the architecture allows the definition of two «logical» processors each one having its own register sets and able to interface with the bus.

Page 30: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

30

Hyperthreading

Obviously multithreads can be executed on conventional processors too with context switches and through time slice policies.

The Hyperthreading, on the contrary, allows the the execution of two threads without context switch

Page 31: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

31

HyperthreadingImplications

The executed processes behave as if they were executed on different processors.

The processor must maintain a copy of the architectural state od each logial processor which use the same hardware resources (in practice instructiond and operations must be «tagged» that is they have a tag indicating the processor the belong to)

The architectural state consists of all the general registers and of the machine registers

The two logical processors share the caches (physically addressed) the BTB (as the caches), the FU and the control circuits (NOT the TLB)

Improperly stated the hyperthreading can be though as a pipeline level multiprogramming

The die complexity increase is about 5%

Page 32: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

32

Performance

20% efficiency increase against 5% complexity increase. In general the efficiency increase is between 16% and 28%

Page 33: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

34

Xeon

The efficiency of the processor when only one thread is executed must be the same of the processor without hyprethreading. This implies that the resources must be recombined

Design criteria

High reduction of the die size increase (less than 5%)

The stall of a «virtual» processor (caches misses, branch predisction error, hazards etc.) can’t block the oher processor. There are queues between the stages and no virtual processor can use the resources 100%

Pipeline XEON

Thread 1

Thread 2

Trace cache

P IV physical registers

Classical pipeline

Page 34: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

35

Xeon

Two distinct sets of registers: GPRs, Segment, Flag, EIP, x87, MMX, SSE, TR, Machine registers etc.

Trace cache hit

The yellow and red colours refer to the two threads

Common stage Partitioned queues

distinct ITLBs (two different paging systems for different processes))

Instruction pointersdistinti

All’allocatore

Trace cache miss

ROM

The decoding is not alternate per clock but for instructions groups

The trace cache is equally subdivided per thread (the m-ops are «tagged» to avoid to replace too many m-ops of one thread)

Common to the two threads (physical address)

Page 35: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

36 RAT registers are obviously replicated

XeonExecution Trace In case of acces request overlap arbitration and alternative assignment

per clock. If a thread is stalles the other executes regularly

The TC contains 12K u- and is 8 ways set associative. All entries are “tagged” by thread (replacement per thread)

Precise replacement LRU (shift register)

Caches L1 (data), L2 e L3 (if any) are unified (physical addresses)

L1 (data cache) is 4 ways associative, very fast, 64 bytes lines and is “write through” to L2.

There is a common DTLB for L1 4 ways with the entries “tagged”. Pages are 4 KB or 4 MB

L2and L3 are 8 ways associative with 128 bytes lines

Arbitration for L2 access FCFS with requests queue where at least one slot per thread is reserved

Output L2 queues different per thread each with 64 bytes slots

BTB partially shared (physical addresses). The branch hystory buffers are different. The PHT is unique with tagged entries. RSB are obviously different (12 slots per thread)

Unified caches can triggers conflicts but also benefits (i.e. the prefetched instructions of one thread could be used by the other or a thread can use data produced by the others)

Page 36: 1 Computer architectures M Pentium IV-XEON. 2 Pentium IV block scheme 32 bytes parallel Four access ports to the EU

37

If there are m-ops available for each clock the service is alternated. If a thread ha reached the limit level is stalled: the 50% is never violated. By so doing the “fairness” is granted and the deadlocks are avoided

From theTraceCache

Pipeline Out Of Order Split clockaccess

As in the Pentium IV there is a microcode ROM per complex instructions

XeonDecode

The decode logic alternates the queues of the threads and maintain two copies of the necessary information

Allocator

It assigns the u-ops to the 128 slots of the reorder buffer (tagged – for the retirement in order), the integer and 128 floating registers, the load (48) and store (24) buffers. Buffer are partitioned so that each thread can use only 50%