Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

[email protected] 2017 1

The CPU

Computer Architecture

1DT016 distanceFall 2017

http://xyx.se/1DT016/index.php

Per FoyerMail: [email protected]

1


Where in the machine now?

2

Level 0

Level 1

Level 2

Level 3

Level 4

Level 5

Digital Logic Level

Microprogramminglevel

Conventionalmachine level

Problem-orientedlanguage level

Operating systemmachine level

Assembly languagelevel

addmul: addi $r1, $zero, 2 mul $r1, $r1, 2 jr $ra

int addmul( int t ){ return (t + 2) * 2;}

li $v0, 4syscall

0x24020004 0x0000000c0x03E00008

110110101111010000010110000100010011111010100001

Translation (compiler)

Translation (assembler)

Partial interpretation (OS)

Interpretation(microprogram)

Executed byhardware

Intel 4004


MCS-4 (chipset):i4001: ROM (256 bytes)i4002: RAM (40 bytes)i4003: Shift register (10 bits)i4004: CPU (4-bit)

4-bit

Designed by Federico Faggin

Intel 4040


Advanced features at the time

4-bit

Intel 8008


8-bit

I principle a stretched i4004

Intel 8080


8-bit

Zilog Z80


8-bit

Zilog Z80 (2)


8-bit

CISC – The early days

Complex Instruction Set Computer

•Primary memory was slow and expensive

•Reduce memory access The more that could be done inside the CPU, the better

•µ-code can be (quite) easily changed Enhance or reduce the ISA, fix machine level bugs

•The more machine instructions avaliable, the easier to write high-level compilers producing ”tight code”.

•Less amount of PM needed to store machine instructions

Gave: Large and writeable µ-stores, and in some cases even nano-code

Example: VAX 11/750 had 303 µ-coded assembler instructions


Example: IBM 4341 mainframe


1. IPL1 (hardware): Read CPU instruction set from removable media (5 ¼” floppy)

2. IPL2: Read boot firmware from removable media (5 ¼” floppy)

3. IPL3: Use firmware to boot OS loader from disk drive 0.

4. IPL4: Load OS from disk drive x.

CISC Galore!

• CPU machine instructions can be added or removed• Bugs in implementations of CPU machine instructions can be corrected

IPL = Initial Program Loader

µ-coded CISC Trivia

• It’s almost impossible to design a CPU without bugs on the hardware level (including µ-code)

• Intel had problems with the infamous FDIV (floating point divide) FPU instruction in the Pentium family.

• The affected part was defined in µ-code so the problem was fixed between CPU steppings (hardware revisions) without any hardware redesign.

• It’s so common with CPU bugs that vendors release several erratas for the same type of CPU during it’s life span (but no erratas when EOL)


Intel i7 bugs


This errata goes on for 14 pages...

Intel 8088 / 8086


16-bit

CISC: MC68000


16-bit

Asynchronous bus traffic!

Intel 80286


16-bit

Intel 80386 (386DX/SX)


32-bit

MMU

Intel 80486 (486DX/SX)


32-bit

Intel Pentium (”586”)


32-bit

Flynn’s [1] taxonomy


Type Instructions Datum [2] Examples

SISD 1 1 Classic vN / Harvard

SIMD 1 Multi Vector processors

MISD Multi 1 Fault tolerant systems

MIMD Multi Multi Multiprocessors

[1] Michael J. Flynn, Stanford university, 1966

[2] Datum may refer to a part of a data set, e.g. in shared memory

Intel Pentium MMX


MMX – SIMD3DNOW! SSE

APIC – Advanced ProgrammableInterrupt Controller

32-bit

A µ-Coded CPU in no timeWhat do we need?

•A Register bank (or a few discrete registers)

•PC and SP registers

•One or two ALUs with a set of ALUops

•A flag register: Z, P, N, C, …

•An Unidirectional MAR (Memory Address Register)

•A Bidirectional MDR (Memory Data Register)

•Outgoing control signals (MREQ, IORQ, RD, WR, …)

•An internal databus and a control bus

•A µ-coded Control Unit (CU)

…and, of course an [email protected] 2017 21

µControl Unit: Horizontal µ-code


• Maximum parallelism, given the number of bits from the µROM

• Many control lines make it easier to modify the ISA

• µROM acts as a sequencer of arbitrary control signals

• Easily expanded by adding parallel ROMs

• Large ROMs costs expensive silicon die space

• Sometimes called Wide µ-code

µControl Unit: Vertical µ-code


• Must carefully group signals together so partial parallelismcan be guaranteed

• Uses less high speed ROM which frees upp space on silicon die

• Less flexibility to enhance the ISA

• Sometimes called narrow µ-code

µControl Unit with Lookup table


• The machine instruction in IR is index to the Lookup ROM• The value in the Lookup ROM is the start address (set in µPC) for

the µ-code corresponding to the machine code in IR

µ-code: Fetch, Decode and Execute


1. PCout, MARin, MARlatch, PCincr

2. MREQout, memRD, memWAIT

3. MDRin, MDRlatch, MDRout,IRin, IRlatch

1. µJTable[ IR[31:26]out ], µPCin

1. IR[20:16]out, REGBANKin, REG-RD,REGBANKout, ALU1in, IR[15:0]out, ALUin2, ALUopADD, µPCincr

2. ALUout, MARin, MARlatch, µPCincr

3. MREQout, memRD, memWAIT

4. MDRin, MDRlatch, MDRout, IR[25:21], REGBANKin, REG-WR, GotoFETCH

lw $t1, 100($t2)

Fetch

Decode

Execute

The birth of RISC


RISC – The early daysReduced Instruction Set Computer

Motives:

•CISC ISAs often overly complex

•Many CISC instructions are very seldom used

•Analyzing an arbitrary program reveals that it most often is written with just a few number of basic constructs:

• Simple variable or memory assignments• If … then … else (conditional jumps/branches)• Loops• Subroutine / function calls

•How many instructions in the Intel i7 are used less than 0.25%? Is it really worth having them on silicon? - Probably no, but there is another story to this: The need to be compatible with every earlier x86 processor ever made…


RISC – The ideas

• Create a set of a few very carefully choosen machine instructions of a single fixed size

• Only Load and Store instructions refer to memory

• Create optimizing compilers that take full use of these few machine instructions

• Replace µ-code with hardwired control logic

• Reg-to-Reg-operations in one clock cycle

• Complex math instructions co-processor

• Fewer instructions means freed space on silicon that can be used for pipelines, larger register files and caches.


RISC – The ideas (2)

Main goals:

•Make the datapath turnaroud time as short as possible.

•When no more instructions can be removed, the specification of the RISC ISA is finalized.


MIPS R3000


32-bit

Microprocessor without Interlocking Pipe Stages

MIPS single cycle datapath


Pipelining Analogy


Pipelined laundry: Overlapping execution Parallelism improves performance

PipeliningDesign the CPU with the overall goal to start a new instruction every clock cycle

Use pipelines for each step of the instruction cycle:

1. Instruction fetch [IF]• Get instruction from program memory

2. Instruction decode [ID]• Translate opcode to control signals and read registers

3. Execute [EX]• Perform ALU operation, calculate branch tagets

4. Memory [MEM] (data)• Access memory if needed (Load/Store)

5. Write back [WB]• Update register file


Single-cycle vs. Multicycle vs. Pipelined


MIPS pipelined datapath


Time graphs


Clock cycle

Latency: 5 cycles Throughput: 1 inst. / cycle Concurrency: 5

Pipelining: Hazards

Situations that prevent starting the next instruction in the next cycle (creating pipeline stalls):

•Structure hazards

• A required resource is busy

•Data hazard

• Need to wait for previous instruction to complete its read/writeadd $s0, $t0, $t1sub $t2, $s0, $t3

•Control hazard

• Deciding on control action depends on previous instruction


Pipelining: Data Hazards

Dependencies backward in time cause hazards

Example: Instruction flow – 5 stage pipeline:

lw $1, 4($2)sub $4, $1, $5 # $1 is still in pipelineand $6, $1, $7or $8, $1, $9 # $1 available in stage 4xor $4, $1, $5

”Load-use” data hazard

•May be ”fixed” with a pipeline stall

•…or by inserting NOPs

•…or reordering instructions


Pipelining: Structure hazards

• Conflict for use of a resource

• In MIPS pipeline with single memory:

• Load/store requires data access• Instruction fetch would have to stall for that cycle

Would cause a pipeline ”bubble”

• Hence, pipelined datapaths require separate instruction/data memories

• …or separate instruction/data caches


Pipelining: Control Hazards

• When the flow of instruction addresses is not sequential (i.e. not PC = PC + 4), due to change of instruction flow

• Unconditional branches (j, jal, jr)• Conditional branches (beq, bne,…)• Exceptions (internal or external interrupts)

• Possible approaches

• Stall (impacts CPI – Clocks Per Instruction)• Move decision point as early in the pipeline as possible

thereby reducing the number of stall cycles• Delay decision (requires compiler support)

• Control hazards occur less frequently than data hazards

• Jumps are very infrequent – only 3% of the instructions ina normal program


(binary executable)

Code reorder (”afterburner”)


C / C++, …

gcc –S …

Assemblycode

Reorganizer

ReorderedAssembly code

gas –o …(Assembler)

Object code

ld –o …

lw $t1, 0($t0) # blw $t2, 4($t0) # elw $t4, 8($t0) # fadd $t3, $t1, $t2 # b + esw $t3, 12($t0) # a add $t5, $t1, $t4 # b + fsw $t5, 16($t0) # c

lw $t1, 0($t0) # blw $t2, 4($t0) # eadd $t3, $t1, $t2 # b + esw $t3, 12($t0) # a lw $t4, 8($t0) # fadd $t5, $t1, $t4 # b + fsw $t5, 16($t0) # c

Code scheduling to avoid stalls

• Reorder code to avoid use of load result in the next instruction

• Example: a = b + e; c = b + f;


13 cycles 11 cycles

Stall

Stall

(reordered code)

ARM v7 core


32-bit

3-stage pipeline (F, D, E)

ARM Cortex CPU Core


32 / 64-bit

MCU with multiple ARM cores


Intel Core2 microarchitecture


64-bit

AMD Bulldozer Core


64-bit

PIC controller MCU


8-bit

von Neumannor Harvard?


AVR tiny85 MCU

von Neumannor Harvard?

8-bit

2-stage pipeline

PIC32MX795F512L MCU


32-bit

Nvidia Tegra CPU/GPU SoC


!!!

IBM Power8


Heavy usage of caches

64-bit

PowerPC G4 Altivec


64-bit

Cray-1 Vector processor


64-bit instr.512 bit vectors

Co-processors

Used to take load off main processors

•Floating Point Units (FPU)

•I/O-processors

•Crypto co-processors

•Graphical Processing Units (GPU)

Examples from the 8086 era:

•8087 FPU

•8089 I/O processor

•8288 Bus controller


Cryptographic Co-processor


Documents

Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and