Upload
lanza
View
19
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Structure of Computer Systems. Course 7 – examples of CPU implementations - Microprocessors. Microprocessors. Definition 1: It is a VLSI circuit that integrates a central processing unit (CPU) Definition 2: An integrated circuit that integrates: one or more central processing units (CPUs) - PowerPoint PPT Presentation
Citation preview
11
Structure of Computer Structure of Computer SystemsSystems
Course 7 – examples of CPU Course 7 – examples of CPU implementations - Microprocessorsimplementations - Microprocessors
22
MicroprocessorsMicroprocessors
Definition 1:Definition 1: It is a VLSI circuit that integrates a central It is a VLSI circuit that integrates a central
processing unit (CPU)processing unit (CPU) Definition 2:Definition 2:
An integrated circuit that integrates:An integrated circuit that integrates:• one or more central processing units (CPUs) one or more central processing units (CPUs)
Symmetric multiprocessor architectureSymmetric multiprocessor architecture Asymmetric multiprocessor architectureAsymmetric multiprocessor architecture
• Cache memoryCache memory• Other components: Other components:
Interrupt controller, Interrupt controller, Bus management unit, Bus management unit, Memory Management unit (MMU) Memory Management unit (MMU)
33
Microprocessors - Microprocessors - First microprocessor:First microprocessor:
Intel Company, I4004 – 4 bits organizationIntel Company, I4004 – 4 bits organization First successful microprocessor:First successful microprocessor:
Intel I8080 – 8 bits processorIntel I8080 – 8 bits processor First 16 bits processorFirst 16 bits processor
Intel I8086 – Intel I8086 – First 32 bit processorFirst 32 bit processor
Intel I80386Intel I80386 Superscalar microprocessor architectureSuperscalar microprocessor architecture
Pentium ProPentium Pro 64 bits processors, multi-core 64 bits processors, multi-core
architecturesarchitectures Pentium IV, dual core, Core DuoPentium IV, dual core, Core Duo
44
YearYear ProcessorProcessor structurestructure Memory Memory spacespace
Main characteristicsMain characteristics
19711971 I4004I4004 4 biti4 biti first first μμPP
19721972 I8008I8008 8 biti8 biti 16ko16ko First First μμP on 8 bitsP on 8 bits
19741974 80808080 8 biti8 biti 64ko64ko First successful First successful μμP P
19781978 8086, 80888086, 8088 16 biti16 biti 1Mo1Mo First First μμP on 16 bits, bases for the first PCP on 16 bits, bases for the first PC
19821982 8028680286 16 biti16 biti 16Mo16Mo PC-ATPC-AT
19851985 8038680386 32 biti32 biti 4Go4Go First First μμP on 32 bitsP on 32 bits
19891989 8048680486 32 biti32 biti 4 Go4 Go Incorporated FPUIncorporated FPU
19931993 PentiumPentium 32 biti32 biti 4Go4Go pipelinepipeline
19951995 P. ProP. Pro 32 biti32 biti 64 Go64 Go P6 super-pipeline architectureP6 super-pipeline architecture
19971997 P. IIP. II 32 biti32 biti 64 Go64 Go MMX technologyMMX technology
19991999 P. IIIP. III 32 biti32 biti 70 To70 To SSE2 technologySSE2 technology
20022002 P. IVP. IV 32 biti32 biti 70 To70 To NetBurst architecture NetBurst architecture
20042004 P. IVP. IV 64 biti64 biti 70 To70 To Hyper-threading technologyHyper-threading technology
20062006 Core 2Core 2 64 biti64 biti 70 To70 To Multicore architecture (2 cores/chip)Multicore architecture (2 cores/chip)
20072007 Dual CoreDual Core 64 biti64 biti 70 To70 To 2 processors/chip2 processors/chip
2008-92008-9 I5, I7I5, I7 64 biti64 biti 70 To, 70 To, Nehalem architecture, multicore and hyper-Nehalem architecture, multicore and hyper-threading 4cores/8 multithread cache 8Mo (L3)threading 4cores/8 multithread cache 8Mo (L3)
20112011 Sandy BridgeSandy Bridge
55
Components of a Components of a microprocessormicroprocessor
Traditional components:Traditional components: Control Unit (CU)Control Unit (CU) Arithmetical and Logical Unit (ALU)Arithmetical and Logical Unit (ALU) General and special Registers (GR, SR)General and special Registers (GR, SR)
Supplementary components:Supplementary components: Cache memories (Cache)Cache memories (Cache)
• high speed low capacity memorieshigh speed low capacity memories• hierarchical organization on 2-3 levelshierarchical organization on 2-3 levels
Mathematical co-processor (CoP)Mathematical co-processor (CoP)• for floating point arithmeticfor floating point arithmetic
Memory Management Unit (MMU)Memory Management Unit (MMU)• controls the traffic (instructions and data) between the controls the traffic (instructions and data) between the
main memory and the cache memorymain memory and the cache memory Interrupt controllerInterrupt controller
• handles internal and external eventshandles internal and external events• synchronize the processor with I/O interfacessynchronize the processor with I/O interfaces
66
Signals of a microprocessor – Signals of a microprocessor – the System Busthe System Bus
μP
Memory Memory
I/O interface I/O interface
I/O dev. I/O dev.
Address
Data
Commands
77
Structure of a PC Structure of a PC (a more realistic view)(a more realistic view)
μP
Chipset
N
Chipset
S
SVGAAGP
PCI
Mem Mem
Net
Keyboard Mouse
88
Typical signals for a Typical signals for a microprocessormicroprocessor
Micro-processor
Address signals
Data signals
Command signals
Interrupt signals
Bus arbitration signals
Clock signal(s)
Other signals (e.g. status, control)
Power supply signals
99
Typical signals for a Typical signals for a microprocessormicroprocessor
Address signals: AAddress signals: A00-A-Ann Used for specifying memory locations or I/O ports (registers)Used for specifying memory locations or I/O ports (registers) Generated by the microprocessor to other components in order to Generated by the microprocessor to other components in order to
address them (read or write operations)address them (read or write operations) The number of address lines determine the maximum addressing The number of address lines determine the maximum addressing
space of a microprocessorspace of a microprocessor• Ex: 20 lines=> 1MBEx: 20 lines=> 1MB• 32 lines =>4GB32 lines =>4GB
Data signals: DData signals: D00-D-Dmm Bidirectional lines used to transfer instruction codes and data between Bidirectional lines used to transfer instruction codes and data between
the microprocessor and the other components of the systemthe microprocessor and the other components of the system The number of data lines is usually in accordance with the internal The number of data lines is usually in accordance with the internal
organization of the processor (there are also exceptions, see 8088, organization of the processor (there are also exceptions, see 8088, Pentium Pro)Pentium Pro)
The number of data lines determine the maximum width of a data The number of data lines determine the maximum width of a data transferred on a bustransferred on a bus
• Ex: 8, 16, 32, 64 linesEx: 8, 16, 32, 64 lines
1010
Typical signals for a Typical signals for a microprocessormicroprocessor
Command and control signalsCommand and control signals Command signals:Command signals:
• MRDC\, MWTC\, IORC\, IOW\, INTA\MRDC\, MWTC\, IORC\, IOW\, INTA\
• determine memory and interface read and write cyclesdetermine memory and interface read and write cycles
• very important signals, very important signals,
• similar signals for any microprocessorsimilar signals for any microprocessor Control signals: ALE (Address Latch Enable), DEN (Data Control signals: ALE (Address Latch Enable), DEN (Data
enable)enable)• help controlling the address and data amplifiershelp controlling the address and data amplifiers
• specific for every microprocessorspecific for every microprocessor Interrupt signals: INTR, NMIInterrupt signals: INTR, NMI Clock signals: CLK, PCLKClock signals: CLK, PCLK
Power supply signals: GND +5V, 3,3VPower supply signals: GND +5V, 3,3V
1111
Instructions executionInstructions execution Steps:Steps:
Instruction fetchInstruction fetch Operands readOperands read Operation executionOperation execution Write the resultWrite the result
Seen from outside:Seen from outside: Instruction fetch cycle – read from the memory - mandatoryInstruction fetch cycle – read from the memory - mandatory Operand(s) read - optionalOperand(s) read - optional Write the result - optionalWrite the result - optional
Transfer cycle (on the bus) Transfer cycle (on the bus) a transfer on the bus that involve:a transfer on the bus that involve:
• Processor and memory orProcessor and memory or• Processor and an I/O interface Processor and an I/O interface
A cycle has a fixed number of clock periods (determined by the A cycle has a fixed number of clock periods (determined by the microprocessors architecture)microprocessors architecture)
• it may be extended on request with an integer number of clock periods, if a it may be extended on request with an integer number of clock periods, if a slow module is addressed (e.g. EPROM memory)slow module is addressed (e.g. EPROM memory)
A cycle is a sequence of signal activations on the bus (address, data A cycle is a sequence of signal activations on the bus (address, data and command)and command)
• a cycle is described by a time diagrama cycle is described by a time diagram
1212
Time diagrams for transfers on a Time diagrams for transfers on a classical busclassical bus
A0-An
Read Memory Cycle
MRDC
MWTC
D0-Dm
valid address
valid data
tcycletaccess
A0-An
Write Memory Cycle
MRDC
MWTC
D0-Dm
valid address
valid data
tcycletaccess
1313
Processors of the Intel x86 Processors of the Intel x86 familyfamily
I8086 and I8088 I8086 and I8088 EU BIU AH AL AX BH BL BX CH CL CX CS DH DL DX DS SI ES DI SS BP IP SP IR Ext. Bus Temp.Reg Ctrl. Control ALU Unit 1,2,3,4, .. Instruction queue State reg.
Internal structure of the I8086 and I8088
1414
I8086, I8088I8086, I8088 I8086 I8086
16 bits processor with 16 data lines, 20 address lines (1MB addressing 16 bits processor with 16 data lines, 20 address lines (1MB addressing space)space)
40 pins integrated circuit40 pins integrated circuit Supporting circuits:Supporting circuits:
• 8087 – mathematic co-processor (floating point)8087 – mathematic co-processor (floating point)• 8288 – bus controller8288 – bus controller• 88289 – bus arbiter 88289 – bus arbiter
Structure:Structure:• EU –Execution Unit – dedicated for instruction executionEU –Execution Unit – dedicated for instruction execution
CU, ALU, general registers, state registerCU, ALU, general registers, state register
• BIU – Basic Interface Unit – a unit responsible for the operations (transfer BIU – Basic Interface Unit – a unit responsible for the operations (transfer cycles) with the external buscycles) with the external bus
transfers instructions (in advance) and datatransfers instructions (in advance) and data contains: contains:
• Special registers (segment registers, IP)Special registers (segment registers, IP)• Instruction queue, bus amplifiersInstruction queue, bus amplifiers
8088 8088 identical with 8086 but with 8 data signals on the external busidentical with 8086 but with 8 data signals on the external bus
1515
I80286I80286 16 bits processor16 bits processor 16 data lines, 24 address lines (16MB addressing 16 data lines, 24 address lines (16MB addressing
space)space) Working modes: real and protected (privileged)Working modes: real and protected (privileged)
Addressing unit Interfacing unit
Data ampl. External Address ampl. Bus Bus control
Execution unit Instruction unit Instr. Instr. queue decode
Internal structure of the I80286 processor
1616
I80386I80386 32 bits processor, 32 data lines, 32 address lines (4GB addressing 32 bits processor, 32 data lines, 32 address lines (4GB addressing
space)space) General registers extended to 32 bitsGeneral registers extended to 32 bits 2 extra segment registers (FS and GS)2 extra segment registers (FS and GS) Protected mode improvedProtected mode improved
Segmenting Paging unit unit Execution Interface unit unit Decoding Instr. prefetch unit unit
Internal structure of the I80386 processor
1717
I80486I80486 Integrates: processor + co-processor + MMUIntegrates: processor + co-processor + MMU Enables the use of cache memoryEnables the use of cache memory Protected mode improvedProtected mode improved
Segmenting Paging unit unit Integer exec. unit Cache Bus Unit interf. Float unit exec. unit Instr. Instr. Decoder prefetch u.
Internal structure of the I80486
1818
PentiumPentium
Two pipelines: U (integers) and V (floats)Two pipelines: U (integers) and V (floats) 64 bits external bus (for a 32 bits processor)64 bits external bus (for a 32 bits processor) Versions: Versions:
Pentium –2 pipeline architecturePentium –2 pipeline architecture Pentium Pro Pentium Pro Pentium II Pentium II - superscalara P6 architecture- superscalara P6 architecture Pentium IIIPentium III Pentium IV – NetBurst architecturePentium IV – NetBurst architecture I7, I5, I3 I7, I5, I3 - multicore and hyperthreading - multicore and hyperthreading
1919
Pentium ProcessorsPentium Processors
Pentium ProPentium Pro Superscalar P6 architecture (CPI<1)Superscalar P6 architecture (CPI<1) Dynamic instruction execution:Dynamic instruction execution:
• Data flow analysisData flow analysis• Branch prediction Branch prediction • Speculative execution of instructions Speculative execution of instructions
Pentium IIPentium II MMX technology:MMX technology:
• a SIMD execution unit dedicated for multimedia dataa SIMD execution unit dedicated for multimedia data• Parallel (SIMD) execution of arithmetic operationsParallel (SIMD) execution of arithmetic operations• 57 new MMX instructions57 new MMX instructions
Pentium IIIPentium III SSE2 technologySSE2 technology
• Parallel execution (SIMD) on floating point variablesParallel execution (SIMD) on floating point variables• good for 2D/3D graphicsgood for 2D/3D graphics
2020
P6 superscalar architectureP6 superscalar architecture
3 autonomous units, 12 pipeline stages3 autonomous units, 12 pipeline stages Speculative executionSpeculative execution
R e tire m e n t u n it
Instruction fetch and
decode unit
Instruction dispatch and execute unit
Instruction pool
Functional blocks of the P6 architecture
2121
Detailed view of the P6 architectureDetailed view of the P6 architecture
System bus L2 Cache Bus interface unit (BIU) L1 ICache L1 DCache
Instruction dispatch and execute unit
Retirement unit
Instruction fetch and
decode unit
In s tru c t io n P o o l
2222
Instruction fetch and decoding unitInstruction fetch and decoding unit
Fetch and decode Fetch and decode instructions in advanceinstructions in advance
In-order unitIn-order unit 3 instructions 3 instructions
decoded /clockdecoded /clock Branch predictionBranch prediction Components:Components:
Decoder (3 units)Decoder (3 units) Address generator unit Address generator unit
(next_IP)(next_IP) Branch target bufferBranch target buffer Micro-operation sequencerMicro-operation sequencer Alias registers allocatorAlias registers allocator
From BIU (Basic Interface Unit) L1 ICache Next_IP Branch Instruction target Decoder buffer (x3) Micro-operations sequencer To the instruction Alias reg. pool allocator
Instruction fetch and decoding unit
2323
Instruction dispatch and execute Instruction dispatch and execute unitunit
Responsible for instruction Responsible for instruction executionexecution
Out-of-order unitOut-of-order unit 7 execution units + reservation 7 execution units + reservation
stationstation IEU – Integer Execution UnitIEU – Integer Execution Unit FEU – Floating-point Execution FEU – Floating-point Execution
UnitUnit MMX – Multimedia execution unitMMX – Multimedia execution unit AGU – Address generation unitAGU – Address generation unit JGU – Jump generation unitJGU – Jump generation unit
Reservation station MMX FEU Port 0 IEU Instruction MMX pool JEU Port 1 IEU Port 2 AGU read Port 3,4 AGU write
Instruction dispatch and execute
2424
Retirement UnitRetirement Unit
Reestablish the Reestablish the normal order of the normal order of the instructions (of results)instructions (of results)
In-order unitIn-order unit Components:Components:
MIU – memory MIU – memory interface unitinterface unit
RRF – Retirement RRF – Retirement register fileregister file
DCache Reservation UIM station RRF Instruction pool
Retirement unit
2525
Solving hazard cases in the P6 Solving hazard cases in the P6 architecturearchitecture
Control hazard:Control hazard: complex branch prediction, BTB, next address predictorcomplex branch prediction, BTB, next address predictor out-of-order instruction executionout-of-order instruction execution execute both branches of an ifexecute both branches of an if
Data hazard:Data hazard: alias registers: renaming of registers and more internal registers (40) alias registers: renaming of registers and more internal registers (40)
than those seen by the programmerthan those seen by the programmer out-of-order instruction executionout-of-order instruction execution data dependency treedata dependency tree
Structural hazardStructural hazard multiple execution units (7 ALUs)multiple execution units (7 ALUs) separate instruction and data cacheseparate instruction and data cache reservation stationsreservation stations
In essence it is an implementation of Tomasulo’s methodIn essence it is an implementation of Tomasulo’s method
2626
The P6 BusThe P6 Bus
The main elements of the P6 bus:The main elements of the P6 bus: the bus works in a the bus works in a synchronous modesynchronous mode; every signal ; every signal
is considered on clock signal edgesis considered on clock signal edges transfers are made through transfers are made through transactionstransactions that may that may
be executed in parallelbe executed in parallel it is it is a multi-processor busa multi-processor bus; more processors on the ; more processors on the
same bussame bus block transfersblock transfers are preferred are preferred there are there are error detection and correction mechanismserror detection and correction mechanisms there are mechanisms that assure there are mechanisms that assure cache memory cache memory
consistencyconsistency a a new digital technologynew digital technology (different amplifiers) that (different amplifiers) that
assure high frequency transmissions on busassure high frequency transmissions on bus
2727
Transfer on the P6 busTransfer on the P6 bus Parallel transactions (pipeline)Parallel transactions (pipeline) Phases:Phases:
ArbitrationArbitration – decides which master has access on the bus– decides which master has access on the bus Transfer requestTransfer request – specifies the request (read or write, start – specifies the request (read or write, start
address, number of bytes)address, number of bytes) SnoopingSnooping – detect and solve cache inconsistencies– detect and solve cache inconsistencies Error Error – – detect and solve transmission errors (ECC – error detect and solve transmission errors (ECC – error
correction code on data and parity on address and command correction code on data and parity on address and command signals)signals)
Response Response – specifies the type of the answer (now, delayed, – specifies the type of the answer (now, delayed, refused)refused)
TransferTransfer – data transfer in accordance with the request– data transfer in accordance with the request Technology: GTL (instead of TTL)Technology: GTL (instead of TTL)
2828
Time diagram for the P6 busTime diagram for the P6 bus
1 2 3 4 5 6 7 8 9 10
11
12
13
14
15
16
BCLK
Arbitrare
Cerere Eroare
Spionare
Răspuns
Transfer
Concurrent transactions on the P6 bus
2929
Pentium IV –Pentium IV –NetBurst Architecture (7NetBurst Architecture (7thth generation) generation)
a a 20 stage pipeline architecture20 stage pipeline architecture double compared with P6 double compared with P6
bus frequency is increased 4 timesbus frequency is increased 4 times 400MHz, with "quad pump“ technology, 400MHz, with "quad pump“ technology, 3.2Gbytes/s transfer speed 3.2Gbytes/s transfer speed
doubles the speed of the ALUdoubles the speed of the ALU, , 2 arithmetical operations are executed in every clock period; 2 arithmetical operations are executed in every clock period; the ALU works with a double frequency clockthe ALU works with a double frequency clock
the use of the use of very high speed cache memoryvery high speed cache memory Advanced Transfer Cache, that assures at 2GHz 64Gbytes/s data transferAdvanced Transfer Cache, that assures at 2GHz 64Gbytes/s data transfer
extension of the extension of the MMX technologyMMX technology the the SSESSE – – Streaming SIMD Extension Streaming SIMD Extension 144 new SIMD instructions that extend the data width to 128 bits (16 bytes 144 new SIMD instructions that extend the data width to 128 bits (16 bytes
processed in parallel)processed in parallel) improvement of branch predictionimprovement of branch prediction with aprox. 30% with aprox. 30%
through the extension of the BTB unit andthrough the extension of the BTB unit and increasing the instruction queue to 126 instructionsincreasing the instruction queue to 126 instructions
3030
Pentium IVPentium IV
BTB
Decoder
Alias reg alocator
Trace cache
Instr. queues for microoperations
Schedulers
L2 Cache and control
Reg. for „floats” Registers for „integers”
ALU ALU ALU ALU AGU AGUALU-F ALU-F
L1 D-Cache
ROM
The NetBurst Pentium IV architecture
Interface with the external bus
Instruction fetch and decode
Instruction scheduling and
execution
3131
Pentium IVPentium IV
New tendencies:New tendencies: Hyper-threading technologyHyper-threading technology
• two threads executed in parallel on the same coretwo threads executed in parallel on the same core Multi-core technologyMulti-core technology
• more processors on the same chipmore processors on the same chip 64 bits architecture64 bits architecture
3232
I7, I5, I3I7, I5, I3 Nehalem architecture - internal viewNehalem architecture - internal view
3333
Nehalem architectureNehalem architectureexternal viewexternal view
3434
Nehalem architectureNehalem architecturemultiprocessor configurationmultiprocessor configuration
Communication on FSB – Front side bus
Communication on QPI – QuickPath Interconnect
3535
Sandy bridge architectureSandy bridge architecture The north bridge (memory controller, graphics controller and PCI The north bridge (memory controller, graphics controller and PCI
Express controller) is integrated in the same chip as the rest of the Express controller) is integrated in the same chip as the rest of the CPU. First models will use a 32-nm manufacturing processCPU. First models will use a 32-nm manufacturing process
Ring architecture - 256-bit/cycleRing architecture - 256-bit/cycle Two load/store operations per CPU cycle for each memory channelTwo load/store operations per CPU cycle for each memory channel New decoded microinstructions cache (L0 cache, capable of storing New decoded microinstructions cache (L0 cache, capable of storing
1,536 microinstructions, which translates in more or less to 6 kB)1,536 microinstructions, which translates in more or less to 6 kB) 32 kB L1 instruction and 32 kB L1 data cache per CPU core (no change 32 kB L1 instruction and 32 kB L1 data cache per CPU core (no change
from Nehalem)from Nehalem) L2 memory cache was renamed to “mid-level cache” (MLC) with 256 kB L2 memory cache was renamed to “mid-level cache” (MLC) with 256 kB
per CPU coreper CPU core L3 memory cache is now called LLC (Last Level Cache), it is not unified L3 memory cache is now called LLC (Last Level Cache), it is not unified
anymore, and is shared by the CPU cores and the graphics engineanymore, and is shared by the CPU cores and the graphics engine Next generation Turbo Boost technologyNext generation Turbo Boost technology New AVX (Advanced Vector Extensions) instruction setNew AVX (Advanced Vector Extensions) instruction set Up to 8 physical cores or 16 logical cores through Hyper-threadingUp to 8 physical cores or 16 logical cores through Hyper-threading
3636
Sandy bridge architectureSandy bridge architecture
1 processor
4 cores
2 processor
8 cores/processor
3737
Evolution of Intel processor Evolution of Intel processor architecturesarchitectures