56
[email protected] 2017 1 The CPU Computer Architecture 1DT016 distance Fall 2017 http://xyx.se/1DT016/index.php Per Foyer Mail: [email protected] 1

Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

[email protected] 2017 1

The CPU

Computer Architecture

1DT016 distanceFall 2017

http://xyx.se/1DT016/index.php

Per FoyerMail: [email protected]

1

[email protected] 2017 2

Where in the machine now?

2

Level 0

Level 1

Level 2

Level 3

Level 4

Level 5

Digital Logic Level

Microprogramminglevel

Conventionalmachine level

Problem-orientedlanguage level

Operating systemmachine level

Assembly languagelevel

addmul: addi $r1, $zero, 2 mul $r1, $r1, 2 jr $ra

int addmul( int t ){ return (t + 2) * 2;}

li $v0, 4syscall

0x24020004 0x0000000c0x03E00008

110110101111010000010110000100010011111010100001

Translation (compiler)

Translation (assembler)

Partial interpretation (OS)

Interpretation(microprogram)

Executed byhardware

Intel 4004

[email protected] 2017 3

MCS-4 (chipset):i4001: ROM (256 bytes)i4002: RAM (40 bytes)i4003: Shift register (10 bits)i4004: CPU (4-bit)

4-bit

Designed by Federico Faggin

Intel 4040

[email protected] 2017 4

Advanced features at the time

4-bit

Intel 8008

[email protected] 2017 5

8-bit

I principle a stretched i4004

Intel 8080

[email protected] 2017 6

8-bit

Zilog Z80

[email protected] 2017 7

8-bit

Zilog Z80 (2)

[email protected] 2017 8

8-bit

CISC – The early days

Complex Instruction Set Computer

•Primary memory was slow and expensive

•Reduce memory access The more that could be done inside the CPU, the better

•µ-code can be (quite) easily changed Enhance or reduce the ISA, fix machine level bugs

•The more machine instructions avaliable, the easier to write high-level compilers producing ”tight code”.

•Less amount of PM needed to store machine instructions

Gave: Large and writeable µ-stores, and in some cases even nano-code

Example: VAX 11/750 had 303 µ-coded assembler instructions

[email protected] 2017 9

Example: IBM 4341 mainframe

[email protected] 2017 10

1. IPL1 (hardware): Read CPU instruction set from removable media (5 ¼” floppy)

2. IPL2: Read boot firmware from removable media (5 ¼” floppy)

3. IPL3: Use firmware to boot OS loader from disk drive 0.

4. IPL4: Load OS from disk drive x.

CISC Galore!

• CPU machine instructions can be added or removed• Bugs in implementations of CPU machine instructions can be corrected

IPL = Initial Program Loader

µ-coded CISC Trivia

• It’s almost impossible to design a CPU without bugs on the hardware level (including µ-code)

• Intel had problems with the infamous FDIV (floating point divide) FPU instruction in the Pentium family.

• The affected part was defined in µ-code so the problem was fixed between CPU steppings (hardware revisions) without any hardware redesign.

• It’s so common with CPU bugs that vendors release several erratas for the same type of CPU during it’s life span (but no erratas when EOL)

[email protected] 2017 11

Intel i7 bugs

[email protected] 2017 12

This errata goes on for 14 pages...

Intel 8088 / 8086

[email protected] 2017 13

16-bit

CISC: MC68000

[email protected] 2017 14

16-bit

Asynchronous bus traffic!

Intel 80286

[email protected] 2017 15

16-bit

Intel 80386 (386DX/SX)

[email protected] 2017 16

32-bit

MMU

Intel 80486 (486DX/SX)

[email protected] 2017 17

32-bit

Intel Pentium (”586”)

[email protected] 2017 18

32-bit

Flynn’s [1] taxonomy

[email protected] 2017 19

Type Instructions Datum [2] Examples

SISD 1 1 Classic vN / Harvard

SIMD 1 Multi Vector processors

MISD Multi 1 Fault tolerant systems

MIMD Multi Multi Multiprocessors

[1] Michael J. Flynn, Stanford university, 1966

[2] Datum may refer to a part of a data set, e.g. in shared memory

Intel Pentium MMX

[email protected] 2017 20

MMX – SIMD3DNOW! SSE

APIC – Advanced ProgrammableInterrupt Controller

32-bit

A µ-Coded CPU in no timeWhat do we need?

•A Register bank (or a few discrete registers)

•PC and SP registers

•One or two ALUs with a set of ALUops

•A flag register: Z, P, N, C, …

•An Unidirectional MAR (Memory Address Register)

•A Bidirectional MDR (Memory Data Register)

•Outgoing control signals (MREQ, IORQ, RD, WR, …)

•An internal databus and a control bus

•A µ-coded Control Unit (CU)

…and, of course an [email protected] 2017 21

µControl Unit: Horizontal µ-code

[email protected] 2017 22

• Maximum parallelism, given the number of bits from the µROM

• Many control lines make it easier to modify the ISA

• µROM acts as a sequencer of arbitrary control signals

• Easily expanded by adding parallel ROMs

• Large ROMs costs expensive silicon die space

• Sometimes called Wide µ-code

µControl Unit: Vertical µ-code

[email protected] 2017 23

• Must carefully group signals together so partial parallelismcan be guaranteed

• Uses less high speed ROM which frees upp space on silicon die

• Less flexibility to enhance the ISA

• Sometimes called narrow µ-code

µControl Unit with Lookup table

[email protected] 2017 24

• The machine instruction in IR is index to the Lookup ROM• The value in the Lookup ROM is the start address (set in µPC) for

the µ-code corresponding to the machine code in IR

µ-code: Fetch, Decode and Execute

[email protected] 2017 25

1. PCout, MARin, MARlatch, PCincr

2. MREQout, memRD, memWAIT

3. MDRin, MDRlatch, MDRout,IRin, IRlatch

1. µJTable[ IR[31:26]out ], µPCin

1. IR[20:16]out, REGBANKin, REG-RD,REGBANKout, ALU1in, IR[15:0]out, ALUin2, ALUopADD, µPCincr

2. ALUout, MARin, MARlatch, µPCincr

3. MREQout, memRD, memWAIT

4. MDRin, MDRlatch, MDRout, IR[25:21], REGBANKin, REG-WR, GotoFETCH

lw $t1, 100($t2)

Fetch

Decode

Execute

The birth of RISC

[email protected] 2017 26

RISC – The early daysReduced Instruction Set Computer

Motives:

•CISC ISAs often overly complex

•Many CISC instructions are very seldom used

•Analyzing an arbitrary program reveals that it most often is written with just a few number of basic constructs:

• Simple variable or memory assignments• If … then … else (conditional jumps/branches)• Loops• Subroutine / function calls

•How many instructions in the Intel i7 are used less than 0.25%? Is it really worth having them on silicon? - Probably no, but there is another story to this: The need to be compatible with every earlier x86 processor ever made…

[email protected] 2017 27

RISC – The ideas

• Create a set of a few very carefully choosen machine instructions of a single fixed size

• Only Load and Store instructions refer to memory

• Create optimizing compilers that take full use of these few machine instructions

• Replace µ-code with hardwired control logic

• Reg-to-Reg-operations in one clock cycle

• Complex math instructions co-processor

• Fewer instructions means freed space on silicon that can be used for pipelines, larger register files and caches.

[email protected] 2017 28

RISC – The ideas (2)

Main goals:

•Make the datapath turnaroud time as short as possible.

•When no more instructions can be removed, the specification of the RISC ISA is finalized.

[email protected] 2017 29

MIPS R3000

[email protected] 2017 30

32-bit

Microprocessor without Interlocking Pipe Stages

MIPS single cycle datapath

[email protected] 2017 31

Pipelining Analogy

[email protected] 2017 32

Pipelined laundry: Overlapping execution Parallelism improves performance

PipeliningDesign the CPU with the overall goal to start a new instruction every clock cycle

Use pipelines for each step of the instruction cycle:

1. Instruction fetch [IF]• Get instruction from program memory

2. Instruction decode [ID]• Translate opcode to control signals and read registers

3. Execute [EX]• Perform ALU operation, calculate branch tagets

4. Memory [MEM] (data)• Access memory if needed (Load/Store)

5. Write back [WB]• Update register file

[email protected] 2017 33

Single-cycle vs. Multicycle vs. Pipelined

[email protected] 2017 34

MIPS pipelined datapath

[email protected] 2017 35

Time graphs

[email protected] 2017 36

Clock cycle

Latency: 5 cycles Throughput: 1 inst. / cycle Concurrency: 5

Pipelining: Hazards

Situations that prevent starting the next instruction in the next cycle (creating pipeline stalls):

•Structure hazards

• A required resource is busy

•Data hazard

• Need to wait for previous instruction to complete its read/writeadd $s0, $t0, $t1sub $t2, $s0, $t3

•Control hazard

• Deciding on control action depends on previous instruction

[email protected] 2017 37

Pipelining: Data Hazards

Dependencies backward in time cause hazards

Example: Instruction flow – 5 stage pipeline:

lw $1, 4($2)sub $4, $1, $5 # $1 is still in pipelineand $6, $1, $7or $8, $1, $9 # $1 available in stage 4xor $4, $1, $5

”Load-use” data hazard

•May be ”fixed” with a pipeline stall

•…or by inserting NOPs

•…or reordering instructions

[email protected] 2017 38

Pipelining: Structure hazards

• Conflict for use of a resource

• In MIPS pipeline with single memory:

• Load/store requires data access• Instruction fetch would have to stall for that cycle

Would cause a pipeline ”bubble”

• Hence, pipelined datapaths require separate instruction/data memories

• …or separate instruction/data caches

[email protected] 2017 39

Pipelining: Control Hazards

• When the flow of instruction addresses is not sequential (i.e. not PC = PC + 4), due to change of instruction flow

• Unconditional branches (j, jal, jr)• Conditional branches (beq, bne,…)• Exceptions (internal or external interrupts)

• Possible approaches

• Stall (impacts CPI – Clocks Per Instruction)• Move decision point as early in the pipeline as possible

thereby reducing the number of stall cycles• Delay decision (requires compiler support)

• Control hazards occur less frequently than data hazards

• Jumps are very infrequent – only 3% of the instructions ina normal program

[email protected] 2017 40

(binary executable)

Code reorder (”afterburner”)

[email protected] 2017 41

C / C++, …

gcc –S …

Assemblycode

Reorganizer

ReorderedAssembly code

gas –o …(Assembler)

Object code

ld –o …

lw $t1, 0($t0) # blw $t2, 4($t0) # elw $t4, 8($t0) # fadd $t3, $t1, $t2 # b + esw $t3, 12($t0) # a add $t5, $t1, $t4 # b + fsw $t5, 16($t0) # c

lw $t1, 0($t0) # blw $t2, 4($t0) # eadd $t3, $t1, $t2 # b + esw $t3, 12($t0) # a lw $t4, 8($t0) # fadd $t5, $t1, $t4 # b + fsw $t5, 16($t0) # c

Code scheduling to avoid stalls

• Reorder code to avoid use of load result in the next instruction

• Example: a = b + e; c = b + f;

[email protected] 2017 42

13 cycles 11 cycles

Stall

Stall

(reordered code)

ARM v7 core

[email protected] 2017 43

32-bit

3-stage pipeline (F, D, E)

ARM Cortex CPU Core

[email protected] 2017 44

32 / 64-bit

MCU with multiple ARM cores

[email protected] 2017 45

Intel Core2 microarchitecture

[email protected] 2017 46

64-bit

AMD Bulldozer Core

[email protected] 2017 47

64-bit

PIC controller MCU

[email protected] 2017 48

8-bit

von Neumannor Harvard?

[email protected] 2017 49

AVR tiny85 MCU

von Neumannor Harvard?

8-bit

2-stage pipeline

PIC32MX795F512L MCU

[email protected] 2017 50

32-bit

Nvidia Tegra CPU/GPU SoC

[email protected] 2017 51

!!!

IBM Power8

[email protected] 2017 52

Heavy usage of caches

64-bit

PowerPC G4 Altivec

[email protected] 2017 53

64-bit

Cray-1 Vector processor

[email protected] 2017 54

64-bit instr.512 bit vectors

Co-processors

Used to take load off main processors

•Floating Point Units (FPU)

•I/O-processors

•Crypto co-processors

•Graphical Processing Units (GPU)

Examples from the 8086 era:

•8087 FPU

•8089 I/O processor

•8288 Bus controller

[email protected] 2017 55

Cryptographic Co-processor

[email protected] 2017 56