Structure of Computer Systems

11

Structure of Computer Structure of Computer SystemsSystems

Course 7 – examples of CPU Course 7 – examples of CPU implementations - Microprocessorsimplementations - Microprocessors

22

MicroprocessorsMicroprocessors

Definition 1:Definition 1: It is a VLSI circuit that integrates a central It is a VLSI circuit that integrates a central

processing unit (CPU)processing unit (CPU) Definition 2:Definition 2:

An integrated circuit that integrates:An integrated circuit that integrates:• one or more central processing units (CPUs) one or more central processing units (CPUs)

Symmetric multiprocessor architectureSymmetric multiprocessor architecture Asymmetric multiprocessor architectureAsymmetric multiprocessor architecture

• Cache memoryCache memory• Other components: Other components:

Interrupt controller, Interrupt controller, Bus management unit, Bus management unit, Memory Management unit (MMU) Memory Management unit (MMU)

33

Microprocessors - Microprocessors - First microprocessor:First microprocessor:

Intel Company, I4004 – 4 bits organizationIntel Company, I4004 – 4 bits organization First successful microprocessor:First successful microprocessor:

Intel I8080 – 8 bits processorIntel I8080 – 8 bits processor First 16 bits processorFirst 16 bits processor

Intel I8086 – Intel I8086 – First 32 bit processorFirst 32 bit processor

Intel I80386Intel I80386 Superscalar microprocessor architectureSuperscalar microprocessor architecture

Pentium ProPentium Pro 64 bits processors, multi-core 64 bits processors, multi-core

architecturesarchitectures Pentium IV, dual core, Core DuoPentium IV, dual core, Core Duo

44

YearYear ProcessorProcessor structurestructure Memory Memory spacespace

Main characteristicsMain characteristics

19711971 I4004I4004 4 biti4 biti first first μμPP

19721972 I8008I8008 8 biti8 biti 16ko16ko First First μμP on 8 bitsP on 8 bits

19741974 80808080 8 biti8 biti 64ko64ko First successful First successful μμP P

19781978 8086, 80888086, 8088 16 biti16 biti 1Mo1Mo First First μμP on 16 bits, bases for the first PCP on 16 bits, bases for the first PC

19821982 8028680286 16 biti16 biti 16Mo16Mo PC-ATPC-AT

19851985 8038680386 32 biti32 biti 4Go4Go First First μμP on 32 bitsP on 32 bits

19891989 8048680486 32 biti32 biti 4 Go4 Go Incorporated FPUIncorporated FPU

19931993 PentiumPentium 32 biti32 biti 4Go4Go pipelinepipeline

19951995 P. ProP. Pro 32 biti32 biti 64 Go64 Go P6 super-pipeline architectureP6 super-pipeline architecture

19971997 P. IIP. II 32 biti32 biti 64 Go64 Go MMX technologyMMX technology

19991999 P. IIIP. III 32 biti32 biti 70 To70 To SSE2 technologySSE2 technology

20022002 P. IVP. IV 32 biti32 biti 70 To70 To NetBurst architecture NetBurst architecture

20042004 P. IVP. IV 64 biti64 biti 70 To70 To Hyper-threading technologyHyper-threading technology

20062006 Core 2Core 2 64 biti64 biti 70 To70 To Multicore architecture (2 cores/chip)Multicore architecture (2 cores/chip)

20072007 Dual CoreDual Core 64 biti64 biti 70 To70 To 2 processors/chip2 processors/chip

2008-92008-9 I5, I7I5, I7 64 biti64 biti 70 To, 70 To, Nehalem architecture, multicore and hyper-Nehalem architecture, multicore and hyper-threading 4cores/8 multithread cache 8Mo (L3)threading 4cores/8 multithread cache 8Mo (L3)

20112011 Sandy BridgeSandy Bridge

55

Components of a Components of a microprocessormicroprocessor

Traditional components:Traditional components: Control Unit (CU)Control Unit (CU) Arithmetical and Logical Unit (ALU)Arithmetical and Logical Unit (ALU) General and special Registers (GR, SR)General and special Registers (GR, SR)

Supplementary components:Supplementary components: Cache memories (Cache)Cache memories (Cache)

• high speed low capacity memorieshigh speed low capacity memories• hierarchical organization on 2-3 levelshierarchical organization on 2-3 levels

Mathematical co-processor (CoP)Mathematical co-processor (CoP)• for floating point arithmeticfor floating point arithmetic

Memory Management Unit (MMU)Memory Management Unit (MMU)• controls the traffic (instructions and data) between the controls the traffic (instructions and data) between the

main memory and the cache memorymain memory and the cache memory Interrupt controllerInterrupt controller

• handles internal and external eventshandles internal and external events• synchronize the processor with I/O interfacessynchronize the processor with I/O interfaces

66

Signals of a microprocessor – Signals of a microprocessor – the System Busthe System Bus

μP

Memory Memory

I/O interface I/O interface

I/O dev. I/O dev.

Address

Data

Commands

77

Structure of a PC Structure of a PC (a more realistic view)(a more realistic view)

μP

Chipset

N

Chipset

S

SVGAAGP

PCI

Mem Mem

Net

Keyboard Mouse

88

Typical signals for a Typical signals for a microprocessormicroprocessor

Micro-processor

Address signals

Data signals

Command signals

Interrupt signals

Bus arbitration signals

Clock signal(s)

Other signals (e.g. status, control)

Power supply signals

99


Address signals: AAddress signals: A00-A-Ann Used for specifying memory locations or I/O ports (registers)Used for specifying memory locations or I/O ports (registers) Generated by the microprocessor to other components in order to Generated by the microprocessor to other components in order to

address them (read or write operations)address them (read or write operations) The number of address lines determine the maximum addressing The number of address lines determine the maximum addressing

space of a microprocessorspace of a microprocessor• Ex: 20 lines=> 1MBEx: 20 lines=> 1MB• 32 lines =>4GB32 lines =>4GB

Data signals: DData signals: D00-D-Dmm Bidirectional lines used to transfer instruction codes and data between Bidirectional lines used to transfer instruction codes and data between

the microprocessor and the other components of the systemthe microprocessor and the other components of the system The number of data lines is usually in accordance with the internal The number of data lines is usually in accordance with the internal

organization of the processor (there are also exceptions, see 8088, organization of the processor (there are also exceptions, see 8088, Pentium Pro)Pentium Pro)

The number of data lines determine the maximum width of a data The number of data lines determine the maximum width of a data transferred on a bustransferred on a bus

• Ex: 8, 16, 32, 64 linesEx: 8, 16, 32, 64 lines

1010


Command and control signalsCommand and control signals Command signals:Command signals:

• MRDC\, MWTC\, IORC\, IOW\, INTA\MRDC\, MWTC\, IORC\, IOW\, INTA\

• determine memory and interface read and write cyclesdetermine memory and interface read and write cycles

• very important signals, very important signals,

• similar signals for any microprocessorsimilar signals for any microprocessor Control signals: ALE (Address Latch Enable), DEN (Data Control signals: ALE (Address Latch Enable), DEN (Data

enable)enable)• help controlling the address and data amplifiershelp controlling the address and data amplifiers

• specific for every microprocessorspecific for every microprocessor Interrupt signals: INTR, NMIInterrupt signals: INTR, NMI Clock signals: CLK, PCLKClock signals: CLK, PCLK

Power supply signals: GND +5V, 3,3VPower supply signals: GND +5V, 3,3V

1111

Instructions executionInstructions execution Steps:Steps:

Instruction fetchInstruction fetch Operands readOperands read Operation executionOperation execution Write the resultWrite the result

Seen from outside:Seen from outside: Instruction fetch cycle – read from the memory - mandatoryInstruction fetch cycle – read from the memory - mandatory Operand(s) read - optionalOperand(s) read - optional Write the result - optionalWrite the result - optional

Transfer cycle (on the bus) Transfer cycle (on the bus) a transfer on the bus that involve:a transfer on the bus that involve:

• Processor and memory orProcessor and memory or• Processor and an I/O interface Processor and an I/O interface

A cycle has a fixed number of clock periods (determined by the A cycle has a fixed number of clock periods (determined by the microprocessors architecture)microprocessors architecture)

• it may be extended on request with an integer number of clock periods, if a it may be extended on request with an integer number of clock periods, if a slow module is addressed (e.g. EPROM memory)slow module is addressed (e.g. EPROM memory)

A cycle is a sequence of signal activations on the bus (address, data A cycle is a sequence of signal activations on the bus (address, data and command)and command)

• a cycle is described by a time diagrama cycle is described by a time diagram

1212

Time diagrams for transfers on a Time diagrams for transfers on a classical busclassical bus

A0-An

Read Memory Cycle

MRDC

MWTC

D0-Dm

valid address

valid data

tcycletaccess

A0-An

Write Memory Cycle

MRDC

MWTC

D0-Dm

valid address

valid data

tcycletaccess

1313

Processors of the Intel x86 Processors of the Intel x86 familyfamily

I8086 and I8088 I8086 and I8088 EU BIU AH AL AX BH BL BX CH CL CX CS DH DL DX DS SI ES DI SS BP IP SP IR Ext. Bus Temp.Reg Ctrl. Control ALU Unit 1,2,3,4, .. Instruction queue State reg.

Internal structure of the I8086 and I8088

1414

I8086, I8088I8086, I8088 I8086 I8086

16 bits processor with 16 data lines, 20 address lines (1MB addressing 16 bits processor with 16 data lines, 20 address lines (1MB addressing space)space)

40 pins integrated circuit40 pins integrated circuit Supporting circuits:Supporting circuits:

• 8087 – mathematic co-processor (floating point)8087 – mathematic co-processor (floating point)• 8288 – bus controller8288 – bus controller• 88289 – bus arbiter 88289 – bus arbiter

Structure:Structure:• EU –Execution Unit – dedicated for instruction executionEU –Execution Unit – dedicated for instruction execution

CU, ALU, general registers, state registerCU, ALU, general registers, state register

• BIU – Basic Interface Unit – a unit responsible for the operations (transfer BIU – Basic Interface Unit – a unit responsible for the operations (transfer cycles) with the external buscycles) with the external bus

transfers instructions (in advance) and datatransfers instructions (in advance) and data contains: contains:

• Special registers (segment registers, IP)Special registers (segment registers, IP)• Instruction queue, bus amplifiersInstruction queue, bus amplifiers

8088 8088 identical with 8086 but with 8 data signals on the external busidentical with 8086 but with 8 data signals on the external bus

1515

I80286I80286 16 bits processor16 bits processor 16 data lines, 24 address lines (16MB addressing 16 data lines, 24 address lines (16MB addressing

space)space) Working modes: real and protected (privileged)Working modes: real and protected (privileged)

Addressing unit Interfacing unit

Data ampl. External Address ampl. Bus Bus control

Execution unit Instruction unit Instr. Instr. queue decode

Internal structure of the I80286 processor

1616

I80386I80386 32 bits processor, 32 data lines, 32 address lines (4GB addressing 32 bits processor, 32 data lines, 32 address lines (4GB addressing

space)space) General registers extended to 32 bitsGeneral registers extended to 32 bits 2 extra segment registers (FS and GS)2 extra segment registers (FS and GS) Protected mode improvedProtected mode improved

Segmenting Paging unit unit Execution Interface unit unit Decoding Instr. prefetch unit unit

Internal structure of the I80386 processor

1717

I80486I80486 Integrates: processor + co-processor + MMUIntegrates: processor + co-processor + MMU Enables the use of cache memoryEnables the use of cache memory Protected mode improvedProtected mode improved

Segmenting Paging unit unit Integer exec. unit Cache Bus Unit interf. Float unit exec. unit Instr. Instr. Decoder prefetch u.

Internal structure of the I80486

1818

PentiumPentium

Two pipelines: U (integers) and V (floats)Two pipelines: U (integers) and V (floats) 64 bits external bus (for a 32 bits processor)64 bits external bus (for a 32 bits processor) Versions: Versions:

Pentium –2 pipeline architecturePentium –2 pipeline architecture Pentium Pro Pentium Pro Pentium II Pentium II - superscalara P6 architecture- superscalara P6 architecture Pentium IIIPentium III Pentium IV – NetBurst architecturePentium IV – NetBurst architecture I7, I5, I3 I7, I5, I3 - multicore and hyperthreading - multicore and hyperthreading

1919

Pentium ProcessorsPentium Processors

Pentium ProPentium Pro Superscalar P6 architecture (CPI<1)Superscalar P6 architecture (CPI<1) Dynamic instruction execution:Dynamic instruction execution:

• Data flow analysisData flow analysis• Branch prediction Branch prediction • Speculative execution of instructions Speculative execution of instructions

Pentium IIPentium II MMX technology:MMX technology:

• a SIMD execution unit dedicated for multimedia dataa SIMD execution unit dedicated for multimedia data• Parallel (SIMD) execution of arithmetic operationsParallel (SIMD) execution of arithmetic operations• 57 new MMX instructions57 new MMX instructions

Pentium IIIPentium III SSE2 technologySSE2 technology

• Parallel execution (SIMD) on floating point variablesParallel execution (SIMD) on floating point variables• good for 2D/3D graphicsgood for 2D/3D graphics

2020

P6 superscalar architectureP6 superscalar architecture

3 autonomous units, 12 pipeline stages3 autonomous units, 12 pipeline stages Speculative executionSpeculative execution

R e tire m e n t u n it

Instruction fetch and

decode unit

Instruction dispatch and execute unit

Instruction pool

Functional blocks of the P6 architecture

2121

Detailed view of the P6 architectureDetailed view of the P6 architecture

System bus L2 Cache Bus interface unit (BIU) L1 ICache L1 DCache

Instruction dispatch and execute unit

Retirement unit

Instruction fetch and

decode unit

In s tru c t io n P o o l

2222

Instruction fetch and decoding unitInstruction fetch and decoding unit

Fetch and decode Fetch and decode instructions in advanceinstructions in advance

In-order unitIn-order unit 3 instructions 3 instructions

decoded /clockdecoded /clock Branch predictionBranch prediction Components:Components:

Decoder (3 units)Decoder (3 units) Address generator unit Address generator unit

(next_IP)(next_IP) Branch target bufferBranch target buffer Micro-operation sequencerMicro-operation sequencer Alias registers allocatorAlias registers allocator

From BIU (Basic Interface Unit) L1 ICache Next_IP Branch Instruction target Decoder buffer (x3) Micro-operations sequencer To the instruction Alias reg. pool allocator

Instruction fetch and decoding unit

2323

Instruction dispatch and execute Instruction dispatch and execute unitunit

Responsible for instruction Responsible for instruction executionexecution

Out-of-order unitOut-of-order unit 7 execution units + reservation 7 execution units + reservation

stationstation IEU – Integer Execution UnitIEU – Integer Execution Unit FEU – Floating-point Execution FEU – Floating-point Execution

UnitUnit MMX – Multimedia execution unitMMX – Multimedia execution unit AGU – Address generation unitAGU – Address generation unit JGU – Jump generation unitJGU – Jump generation unit

Reservation station MMX FEU Port 0 IEU Instruction MMX pool JEU Port 1 IEU Port 2 AGU read Port 3,4 AGU write

Instruction dispatch and execute

2424

Retirement UnitRetirement Unit

Reestablish the Reestablish the normal order of the normal order of the instructions (of results)instructions (of results)

In-order unitIn-order unit Components:Components:

MIU – memory MIU – memory interface unitinterface unit

RRF – Retirement RRF – Retirement register fileregister file

DCache Reservation UIM station RRF Instruction pool

Retirement unit

2525

Solving hazard cases in the P6 Solving hazard cases in the P6 architecturearchitecture

Control hazard:Control hazard: complex branch prediction, BTB, next address predictorcomplex branch prediction, BTB, next address predictor out-of-order instruction executionout-of-order instruction execution execute both branches of an ifexecute both branches of an if

Data hazard:Data hazard: alias registers: renaming of registers and more internal registers (40) alias registers: renaming of registers and more internal registers (40)

than those seen by the programmerthan those seen by the programmer out-of-order instruction executionout-of-order instruction execution data dependency treedata dependency tree

Structural hazardStructural hazard multiple execution units (7 ALUs)multiple execution units (7 ALUs) separate instruction and data cacheseparate instruction and data cache reservation stationsreservation stations

In essence it is an implementation of Tomasulo’s methodIn essence it is an implementation of Tomasulo’s method

2626

The P6 BusThe P6 Bus

The main elements of the P6 bus:The main elements of the P6 bus: the bus works in a the bus works in a synchronous modesynchronous mode; every signal ; every signal

is considered on clock signal edgesis considered on clock signal edges transfers are made through transfers are made through transactionstransactions that may that may

be executed in parallelbe executed in parallel it is it is a multi-processor busa multi-processor bus; more processors on the ; more processors on the

same bussame bus block transfersblock transfers are preferred are preferred there are there are error detection and correction mechanismserror detection and correction mechanisms there are mechanisms that assure there are mechanisms that assure cache memory cache memory

consistencyconsistency a a new digital technologynew digital technology (different amplifiers) that (different amplifiers) that

assure high frequency transmissions on busassure high frequency transmissions on bus

2727

Transfer on the P6 busTransfer on the P6 bus Parallel transactions (pipeline)Parallel transactions (pipeline) Phases:Phases:

ArbitrationArbitration – decides which master has access on the bus– decides which master has access on the bus Transfer requestTransfer request – specifies the request (read or write, start – specifies the request (read or write, start

address, number of bytes)address, number of bytes) SnoopingSnooping – detect and solve cache inconsistencies– detect and solve cache inconsistencies Error Error – – detect and solve transmission errors (ECC – error detect and solve transmission errors (ECC – error

correction code on data and parity on address and command correction code on data and parity on address and command signals)signals)

Response Response – specifies the type of the answer (now, delayed, – specifies the type of the answer (now, delayed, refused)refused)

TransferTransfer – data transfer in accordance with the request– data transfer in accordance with the request Technology: GTL (instead of TTL)Technology: GTL (instead of TTL)

2828

Time diagram for the P6 busTime diagram for the P6 bus

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

BCLK

Arbitrare

Cerere Eroare

Spionare

Răspuns

Transfer

Concurrent transactions on the P6 bus

2929

Pentium IV –Pentium IV –NetBurst Architecture (7NetBurst Architecture (7thth generation) generation)

a a 20 stage pipeline architecture20 stage pipeline architecture double compared with P6 double compared with P6

bus frequency is increased 4 timesbus frequency is increased 4 times 400MHz, with "quad pump“ technology, 400MHz, with "quad pump“ technology, 3.2Gbytes/s transfer speed 3.2Gbytes/s transfer speed

doubles the speed of the ALUdoubles the speed of the ALU, , 2 arithmetical operations are executed in every clock period; 2 arithmetical operations are executed in every clock period; the ALU works with a double frequency clockthe ALU works with a double frequency clock

the use of the use of very high speed cache memoryvery high speed cache memory Advanced Transfer Cache, that assures at 2GHz 64Gbytes/s data transferAdvanced Transfer Cache, that assures at 2GHz 64Gbytes/s data transfer

extension of the extension of the MMX technologyMMX technology the the SSESSE – – Streaming SIMD Extension Streaming SIMD Extension 144 new SIMD instructions that extend the data width to 128 bits (16 bytes 144 new SIMD instructions that extend the data width to 128 bits (16 bytes

processed in parallel)processed in parallel) improvement of branch predictionimprovement of branch prediction with aprox. 30% with aprox. 30%

through the extension of the BTB unit andthrough the extension of the BTB unit and increasing the instruction queue to 126 instructionsincreasing the instruction queue to 126 instructions

3030

Pentium IVPentium IV

BTB

Decoder

Alias reg alocator

Trace cache

Instr. queues for microoperations

Schedulers

L2 Cache and control

Reg. for „floats” Registers for „integers”

ALU ALU ALU ALU AGU AGUALU-F ALU-F

L1 D-Cache

ROM

The NetBurst Pentium IV architecture

Interface with the external bus

Instruction fetch and decode

Instruction scheduling and

execution

3131

Pentium IVPentium IV

New tendencies:New tendencies: Hyper-threading technologyHyper-threading technology

• two threads executed in parallel on the same coretwo threads executed in parallel on the same core Multi-core technologyMulti-core technology

• more processors on the same chipmore processors on the same chip 64 bits architecture64 bits architecture

3232

I7, I5, I3I7, I5, I3 Nehalem architecture - internal viewNehalem architecture - internal view

3333

Nehalem architectureNehalem architectureexternal viewexternal view

3434

Nehalem architectureNehalem architecturemultiprocessor configurationmultiprocessor configuration

Communication on FSB – Front side bus

Communication on QPI – QuickPath Interconnect

3535

Sandy bridge architectureSandy bridge architecture The north bridge (memory controller, graphics controller and PCI The north bridge (memory controller, graphics controller and PCI

Express controller) is integrated in the same chip as the rest of the Express controller) is integrated in the same chip as the rest of the CPU. First models will use a 32-nm manufacturing processCPU. First models will use a 32-nm manufacturing process

Ring architecture - 256-bit/cycleRing architecture - 256-bit/cycle Two load/store operations per CPU cycle for each memory channelTwo load/store operations per CPU cycle for each memory channel New decoded microinstructions cache (L0 cache, capable of storing New decoded microinstructions cache (L0 cache, capable of storing

1,536 microinstructions, which translates in more or less to 6 kB)1,536 microinstructions, which translates in more or less to 6 kB) 32 kB L1 instruction and 32 kB L1 data cache per CPU core (no change 32 kB L1 instruction and 32 kB L1 data cache per CPU core (no change

from Nehalem)from Nehalem) L2 memory cache was renamed to “mid-level cache” (MLC) with 256 kB L2 memory cache was renamed to “mid-level cache” (MLC) with 256 kB

per CPU coreper CPU core L3 memory cache is now called LLC (Last Level Cache), it is not unified L3 memory cache is now called LLC (Last Level Cache), it is not unified

anymore, and is shared by the CPU cores and the graphics engineanymore, and is shared by the CPU cores and the graphics engine Next generation Turbo Boost technologyNext generation Turbo Boost technology New AVX (Advanced Vector Extensions) instruction setNew AVX (Advanced Vector Extensions) instruction set Up to 8 physical cores or 16 logical cores through Hyper-threadingUp to 8 physical cores or 16 logical cores through Hyper-threading

3636

Sandy bridge architectureSandy bridge architecture

1 processor

4 cores

2 processor

8 cores/processor

3737

Evolution of Intel processor Evolution of Intel processor architecturesarchitectures

Documents

Structure of Computer Systems