Upload
leslie-west
View
218
Download
3
Embed Size (px)
Citation preview
1
Computer architectures M
Pentium IV-XEON
2
Pentium IV block scheme
32 bytes parallel
Four access ports to the EU
3
Fetch/Decode
I-TLB
Trace Cache
Rename Allocation
m-ops Queues
m-codeROM
Trace cache BTB
Schedulers
FP Register File
FMulFaddMMX
FP moveFP store
Integer Register File
ALU ALU ALU ALULoadAGU
StoreAGU
L2 C
ache an
d C
ontrol
L1 D-Cache and D-TLB
AGU Address Generation UnitBTB Branch Target BufferI-TLB Instruction TLBD-TLB Data TLB
3.2 GB/s Interfaccia
400> MHz FSBParall. Bus 64 bit
8 Bytes/transaction(up to 1600 MHz too)
AGUAddress
GenerationUnit
Pentium IV block schemeNetburst Architecture
u-ops are directly sent to the FU RS and at the same time inserted into the ROB
BTBCISC
No instruction cache I !!!
4
General features
But• Increased stage number increases the branch penalty • Increased clock frequency requires sometimes multiple cycles for the
ALU operations and therefore delay for the waiting instructions
• Very high clock frequency (up to 4GHz)
• Increased pipeline stages number
• For the data still I level cache (set associative – 8 Kbyte four ways – 64 bytes line, haks the line width of the II level)
• II level cache: unified - 256KB 8 ways associative 256 bit parallel (32 bytes in parallel) 128 Bytes lline. One transfer for each clock edge (32 bytes - ¼ of the line). Bandwith 256 bit x 2 x fclock
• CPU clock frequency double of the rest of the CPU
• FSB 1,6 GHz
• ROB: 126 -m ops
• BTB: 4K entries.
• No L1 instruction cache
5
Trace Cache
BUT….
• The trace cache is “fragile”. In case of di miss the decoding is made one instruction at a time (a single decoder !) and it was verified that the hit rate was lee than 60% and in the 40%of the cases the system behaves as if only instruction at a time were processed. The trace cache was therefore later discarded.
NB: There is obviously a second BTB per the trace cache
•The trace cache has a twofold ways of operation. In «execute mode» provides the pipeline stages with the -m ops. In case of miss, it retrieves the needed instructions from L2, interpretes the branches (through the BTB), retrieves the instructions speculatively, decodes the instructions and generates a «segment» of -m ops inside. Two advantages: no bubble for the branches already predicted and no waste of cache lines (in the normal caches the bytes following a branch are not used because the fetch stops when a branch is encountered while here a line can contain both the branch and the speculated code). Obviously there is always the possibility of mispredictions.
•Since in the majority of cases the instructions are already decoded in the Pentium IV there are are no three decoders but only one because theoretically the need for decoding is reduced.
• In the trace cache there are in general 6 u-ops per line
•This architecture is member of the architectures with «predecoding» where in the I-cache decoded instructions (i.e. AMD Kx , Nehalem etc….)
• For complex instructions (>4 mops) instead of filling the trace cache, a tag is inserted which triggers the decoding ROM which provides the mops whenever such instructions are encountered. No much change for the efficiency: (1 clock delay): the pipeline is in any case activated..
•Trace cache size: about 12K u-ops (equivalente – for Intel – to a 16-18K instruction cache)
6
It was said
• Intel ha increased the clock frequency (and therefore the number of pipelines stages) only for marketing reasons…
• By reducing the transistors size the clock speed can be increased but the functionality too could be improved. This was not the case with the PIV
7
Technological characteristics
Tecnology 0.13 micron (130 nm) 217 mm2 – 42 millions of di transistors 423 pin socket 423 Power supply 1.7V@32A 52 Watts a 1.4 GHz –
much greater at 4 GHz ! – great cooling problem
8
Technological characteristics
9
Socket 423
Technological characteristics
10
Execution core
Arithmeticsection
Memory section(load e store)
The dispatch ports (electrical paths to the FUs) are only 4 (0-3)
Simple executionUnit
Complex executionUnit
RAT
(it is always possible to imagine that the instructions are sent directly to the ports with a reference number and after execution to the ROB)
Port 2 e 3
11
PlV has 2 (fast) x 2 (u-ops/clock) +1 (slow) ALU = 5 virtual ALUs!
Execution core
• Because of the increased clock frequency the RS should have been increased to avoid the underrun : Intel preferred to add the trace cache.
• Here too3 -m ops per clock are extracted. ROB size much greater (126 slots)
• Only 4 dispatch ports (port 0-3 against the 5 ports of P6)
• Up to 6 -m ops per clock are sent to the 4 ports (possible because the speed is twice that of the clock– positive and negative edges - 2x2clock to high speed ALU + 1 slow ALU + 1LD/ST)
• Execution port 0: Integer additions and subtractions, logical operations. Branch evaluation
and store data into the ROB (positive and negative edges) Floating point and SSE set move
• Execution port 1 Fast integer ALU (only additions and subtractions) Slow ALU ALU (complex integer operations. i.e. shift and rotations which
cannot be executed in a single clock) FP e SSE
• Execution port 2 e 3 Load e Store
12
Branch Prediction
• The static predictor is very simple
• branch conditional forward not taken• branch conditional backward taken
• As for P6 there are two branch predictors: one dynamic ad one static
• The BTB consists of 4096 entries (8 ways set associative) and a secret released by Intel. In any case it is not enough if there are many nested loops
13
TC Nxt IP Trace cache next instruction pointer (and branch verification)TC Fetch Trace cache fetchDrive Driving stage (only for electronic reasons to allow the signals
propagaition solo elettronica because of fclock ). Many stages requires two clocks because of the high clock frequency (for instance the transfer from the EU to the ROB)
Pentium IV Pipeline
At the end of the first 5 stages PIV inserts three u-ops in a u-ops queue (in order) for speed compensation
Stages 1-5
14
Scheduler -m ops scheduling (m-ops executed only when operands ready Each scheduler (10-12 u-ops) selects the u-ops to sent to the relative ports (in order)
Memory Scheduler - Load/Store Unit (LSU). Fast ALU Scheduler - Arithmetic-Logic Unit (simple integer and
logical operations) Slow ALU/General FPU Scheduler – other ALU functions and the
majority of the floating-point. Simple FP Scheduler – simple FP operations and FP with memory
Stages 6-12
128Integer registers
128Floating
point registers
Port: bus verso le Un. Fun.
Pentium IV Pipeline
Allocator/Register renaming Internal registers allocation (128 x 2 - 128 integer and 128 FP – dynamically allocated ) and ROB insertion. If LOAD or STORE => buffer register allocation and reservation for the FU. RAT: 128 integer registers, 128 FP, and 128 VPR (vector proc./FP)
Queue ROB – mops (126 slots). The slots number is double: one set LOAD/STORE and the other for the others
15
Pentium IV RAT
EAXEBXECXEDXESIEDIESPEBP
0 012..............................127
RAT
16
Dispatcher (13-14) two stages: the -m ops (ready) are sent to one of the 4 ports (2x2) +1 + 1 = 6 – the execution units use positive and negative clock edges)
RF (15-16) Register file. The required data are retrieved from the sources and loaded into the EU registers
Pipeline Pentium IV
(SSE = Streaming SIMD Extension)
17
Flags Execution resultBrChk Branch Check: when mispredicted 19 stadi back !Drive Rewrite and drive!
Ex Execute (17)
Stages18-20
Pipeline Pentium IV
3,5-4 GHZ Pentium IV has a 31 stages pipeline !!
18
3.2 GB/s Interfaccia
Fetch/Decode (ROM)
Trace Cache
Rename Allocation
m-ops Queues
m-codeROM
BTB512
Schedulers
FP Register File
FMulFaddMMX
FP moveFP store
Integer Register File
ALU ALU ALU ALULoadAGU
StoreAGU
L2 C
ache an
d C
ontrol
L1 D-Cache and D-TLB
The instructions are retrieved from the L2(128=2 x 64(=256 bit) Bytes – a single clock period because the bus uses both clock edges ) – Memory address either from BTB (branch) or I-TLB (4096 entries- 4 ways associative)
Instruction sequence
2x 256 bit64 bytes
Trace cache BTB
BTBCISC
BTB and I-TLB
m-ops 118 bit long in the trace cache.
19
3.2 GB/s Interfaccia
BTB and I-TLB
Fetch/Decode
Trace Cache
Rename Allocation
m-ops Queues
m-codeROM
BTB
Schedulers
FP Register File
FMulFaddMMX
FP moveFP store
Integer Register File
ALU ALU ALU ALULoadAGU
StoreAGU
L2 C
ache an
d C
ontrol
L1 D-Cache and D-TLB
Instruction sequence
BTB512
BTBCISC
BTB: 8 vie 512 lines 8 ways associative – 4 bit Yeh predictionStatic prediction:
Taken backwardUntaken forward
BTB512
Trace cache BTB
20
3.2 GB/s Interfaccia
BTB & I-TLB
Fetch/Decode
Trace Cache
Rename Allocation
m-ops Queues
m-codeROM
BTB
Schedulers
FP Register File
FMulFaddMMX
FP moveFP store
Integer Register File
ALU ALU ALU ALULoadAGU
StoreAGU
L2 C
ache an
d C
ontrol
L1 D-Cache and D-TLB
Trace cache BTB
BTBCISC
Instruction sequence
21
Allocation; 3 m-ops per clock edgeRegister allocation integer or FP (2X128)
3.2 GB/s Interfaccia
BTB & I-TLB
Fetch/Decode
Trace Cache
Rename Allocation
m-ops Queues
m-codeROM
Schedulers
FP Register File
FMulFaddMMX
FP moveFP store
Integer Register File
ALU ALU ALU ALULoadAGU
StoreAGU
L2 C
ache an
d C
ontrol
L1 D-Cache and D-TLB
(Drive)
BTBCISC
BTB512
Trace cache BTB
Instruction sequence
22
3.2 GB/s Interfaccia
BTB & I-TLB
Fetch/Decode
Trace Cache
Rename Allocation
m-ops Queues
m-codeROM
Schedulers
FP Register File
FMulFaddMMX
FP moveFP store
Integer Register File
ALU ALU ALU ALULoadAGU
StoreAGU
L2 C
ache an
d C
ontrol
L1 D-Cache and D-TLB
Two schedulers double FIFO queueIn order within a queue OOO between the two queues.
BTBCISC
BTB512
Trace cache BTB
Instruction sequence
23
3.2 GB/s Interfaccia
BTB & I-TLB
Fetch/Decode
Trace Cache
Rename Allocation
m-ops Queues
m-codeROM
Schedulers
FP Register File
FMulFaddMMX
FP moveFP store
Integer Register File
ALU ALU ALU ALULoadAGU
StoreAGU
L2 C
ache an
d C
ontrol
L1 D-Cache and D-TLB
m-ops to the FU when available. Up to 6 m-ops per clock. If addressed to the same FU FIFO order
The scheduler handles integer and FP u- ops, mispredicted branches and MMX
BTBCISC
BTB512
Trace cache BTB
Instruction sequence
24
3.2 GB/s Interfaccia
BTB & I-TLB
Fetch/Decode
Trace Cache
Rename Allocation
m-ops Queues
m-codeROM
Schedulers
FP Register File
FMulFaddMMX
FP moveFP store
Integer Register File
ALU ALU ALU ALULoadAGU
StoreAGU
L2 C
ache an
d C
ontrol
L1 D-Cache and D-TLB
Access to integer RF and FP and insertion into the ROBAccess to the data cache
BTBCISC
BTB512
Trace cache BTB
Instruction sequence
25
3.2 GB/s Interfaccia
BTB & I-TLB
Fetch/Decode
Trace Cache
Rename Allocation
m-ops Queues
m-codeROM
Schedulers
FP Register File
FMulFaddMMX
FP moveFP store
Integer Register File
ALU ALU ALU ALULoadAGU
StoreAGU
L2 C
ache an
d C
ontrol
L1 D-Cache and D-TLB
DTLB 64 entries full associative - L1 8KB, 4 way associative, 64 bytes lines, write-through, non blocking (out of order load!)
BTBCISC
BTB512
Trace cache BTB
Instruction sequence
26
3.2 GB/s Interfaccia
BTB & I-TLB
Fetch/Decode
Trace Cache
Rename Allocation
m-ops Queues
m-codeROM
Schedulers
FP Register File
FMulFaddMMX
FP moveFP store
Integer Register File
ALU ALU ALU ALULoadAGU
StoreAGU
L2 C
ache an
d C
ontrol
L1 D-Cache and D-TLB
Branch check: a stage for branch verification and match with the prediction. Cache and CISC BTB prediction update
BTBCISC
BTB512
Trace cache BTB
Instruction sequence
27
Complexity vs efficiency
Cost and power consumption increase
An increase 5-fold of the efficiency required an increase of the complexity 15-fold and and an increase 18-fold of the power consumption
(Normalized against 486 performance =1)
5
15
18
Increase more than proportional of the complexity for the efficiency increase (superscalar, branches prediction, OO execution , caches, clock frequency increase)
28
Multiprocessor
At pipelines and FU levels there can be inefficiencies (for instance a program or a thread doen’t use a FU or a pipeline stage in a clock period)
…but transistors number increase…But …
Modern processors execute threads (which belong either to the same program or to different programs or to the OS)
29
Pentium IV
LP LP
BUS
Each processor can execute its own set of instructions, be interrupted, stop (HALT) etc. Each thread is executed OOO
Hyperthreading
LP Logic Processor
Threading is the ability of executing at the same time multiple instruction sequences (see Java) which belong either to the same program or to different (similar to multiprocessor). In the previous processors this required multiple concurrent processors.
Starting from the PENTIUM IV the architecture allows the definition of two «logical» processors each one having its own register sets and able to interface with the bus.
30
Hyperthreading
Obviously multithreads can be executed on conventional processors too with context switches and through time slice policies.
The Hyperthreading, on the contrary, allows the the execution of two threads without context switch
31
HyperthreadingImplications
The executed processes behave as if they were executed on different processors.
The processor must maintain a copy of the architectural state od each logial processor which use the same hardware resources (in practice instructiond and operations must be «tagged» that is they have a tag indicating the processor the belong to)
The architectural state consists of all the general registers and of the machine registers
The two logical processors share the caches (physically addressed) the BTB (as the caches), the FU and the control circuits (NOT the TLB)
Improperly stated the hyperthreading can be though as a pipeline level multiprogramming
The die complexity increase is about 5%
32
Performance
20% efficiency increase against 5% complexity increase. In general the efficiency increase is between 16% and 28%
34
Xeon
The efficiency of the processor when only one thread is executed must be the same of the processor without hyprethreading. This implies that the resources must be recombined
Design criteria
High reduction of the die size increase (less than 5%)
The stall of a «virtual» processor (caches misses, branch predisction error, hazards etc.) can’t block the oher processor. There are queues between the stages and no virtual processor can use the resources 100%
Pipeline XEON
Thread 1
Thread 2
Trace cache
P IV physical registers
Classical pipeline
35
Xeon
Two distinct sets of registers: GPRs, Segment, Flag, EIP, x87, MMX, SSE, TR, Machine registers etc.
Trace cache hit
The yellow and red colours refer to the two threads
Common stage Partitioned queues
distinct ITLBs (two different paging systems for different processes))
Instruction pointersdistinti
All’allocatore
Trace cache miss
ROM
The decoding is not alternate per clock but for instructions groups
The trace cache is equally subdivided per thread (the m-ops are «tagged» to avoid to replace too many m-ops of one thread)
Common to the two threads (physical address)
36 RAT registers are obviously replicated
XeonExecution Trace In case of acces request overlap arbitration and alternative assignment
per clock. If a thread is stalles the other executes regularly
The TC contains 12K u- and is 8 ways set associative. All entries are “tagged” by thread (replacement per thread)
Precise replacement LRU (shift register)
Caches L1 (data), L2 e L3 (if any) are unified (physical addresses)
L1 (data cache) is 4 ways associative, very fast, 64 bytes lines and is “write through” to L2.
There is a common DTLB for L1 4 ways with the entries “tagged”. Pages are 4 KB or 4 MB
L2and L3 are 8 ways associative with 128 bytes lines
Arbitration for L2 access FCFS with requests queue where at least one slot per thread is reserved
Output L2 queues different per thread each with 64 bytes slots
BTB partially shared (physical addresses). The branch hystory buffers are different. The PHT is unique with tagged entries. RSB are obviously different (12 slots per thread)
Unified caches can triggers conflicts but also benefits (i.e. the prefetched instructions of one thread could be used by the other or a thread can use data produced by the others)
37
If there are m-ops available for each clock the service is alternated. If a thread ha reached the limit level is stalled: the 50% is never violated. By so doing the “fairness” is granted and the deadlocks are avoided
From theTraceCache
Pipeline Out Of Order Split clockaccess
As in the Pentium IV there is a microcode ROM per complex instructions
XeonDecode
The decode logic alternates the queues of the threads and maintain two copies of the necessary information
Allocator
It assigns the u-ops to the 128 slots of the reorder buffer (tagged – for the retirement in order), the integer and 128 floating registers, the load (48) and store (24) buffers. Buffer are partitioned so that each thread can use only 50%