This Section Investigates How a Typical CPU

8/14/2019 This Section Investigates How a Typical CPU

1/19


2/19

2

EE 4504 Section 8 3

CPU organization

Recall the functions performed by the

CPU:

Fetch instructions

Fetch data

Process data

Write dataOrganizational requirements that are

derived from these functions:

ALU

Control logic

Temporary storage

Means to move data and instructions in and

around the CPU

EE 4504 Section 8 4

Figure 11.1 External view of the CPU


3/19

3

EE 4504 Section 8 5

Figure 11.2 Internal structure of the CPU

EE 4504 Section 8 6

Register Organization

Registers form the highest level of the

memory hierarchy

Small set of high speed storage locations

Temporary storage for data and control

information

Two types of registers User-visible

May be referenced by assembly-level

instructions and are thus visible to the

user

Control and status registers

Used to control the operation of the CPU

Most are not visible to the user


4/19

4

EE 4504 Section 8 7

User-visible Registers

General categories based on function

General purpose

Can be assigned a variety of functions

Ideally, they are defined orthogonally to the

operations within the instructions

Data

These registers only hold data

Address

These registers only hold address

information

Examples: general purpose address

registers, segment pointers, stack pointers,

index registers

Condition codes

Visible to the user but values set by the

CPU as the result of performing operations

Example code bits: zero, positive, overflow

Bit values are used as the basis for

conditional jump instructions

EE 4504 Section 8 8

Design trade off between general purpose

and specialized registers

General purpose registers maximize flexibility

in instruction design

Special purpose registers permit implicit

register specification in instructions -- reduces

register field size in an instruction

No clear best design approach

How many registers are enough

More registers permit more operands to be held

within the CPU -- reducing memory bandwidth

requirements to some extent

More registers cause an increase in the field

sizes needed to specify registers in an

instruction word

Locality of reference may not support too many

registers Most machines use 8-32 registers (does not

include RISC machines with register

windowing -- will get to that later!)


5/19

5

EE 4504 Section 8 9

How big (wide)

Address registers should be wide enough to

hold the longest address address!

Data registers should be wide enough to hold

most data types

Would not want to use 64-bit registers if the

vast majority of data operations used 16 and

32-bit operands

Related to width of memory data bus

Concatenate registers together to store

longer formats

B-C registers in the 8085

AccA-AccB registers in the 68HC11

EE 4504 Section 8 10

Control and status registers

These registers are used during the

fetching, decoding and execution of

instructions

Many are not visible to the user/programmer

Some are visible but can not be (easily)

modified

Typical registers

Program counter

Points to the next instruction to be executed

Instruction register

Contains the instruction being executed

Memory address register

Memory data/buffer register

Program status word(s)

Superset of condition code register

Interrupt masks, supervisory modes, etc.

Status information


6/19

6


Figure 11.3 Example register organizations


Figure 11.4 Extensions to 32 bits microprocessors


7/19

7


Instruction Cycle

Recall the instruction cycle from Chapter

3:

Fetch the instruction

Decode it

Fetch operands

Perform the operation Store results

Recognize pending interrupts

Based on the addressing techniques from

Chapter 9, we can modify the state

diagram for the cycle to explicitly show

indirection in addressing

Flow of data and information between

registers during the instruction cycle varies

from processor to processor


Figure 11.7 More complete instruction cycle state diagram


8/19

8


Instruction pipelining

The instruction cycle state diagram clearly

shows the sequence of operations that take

place in order to execute a single

instruction

A good design goal of any system is to

have all of its components performinguseful work all of the time -- high

efficiency

Following the instruction cycle in a

sequential fashion does not permit this

level of efficiency

Compare the instruction cycle to an

automobile assembly line

Perform all tasks concurrently, but on different

(sequential) instructions

The result is temporal parallelism

Result is the instruction pipeline


An ideal pipeline divides a task into k

independent sequential subtasks

Each subtask requires 1 time unit to complete

The task itself then requires k time units to

complete

For n iterations of the task, the executiontimes will be:

With no pipelining: nk time units

With pipelining: k + (n-1) time units

Speedup of a k-stage pipeline is thus

S = nk / [k+(n-1)] ==> k (for large n)


9/19

9


First step: instruction (pre)fetch

Divide the instruction cycle into two (equal??)

parts

I-fetch

Everything else (execution phase)

While one instruction is in execution, overlap

the prefetching of the next instruction

Assumes the memory bus will be idle at

some point during the execution phase

Reduces the time to fetch an instruction to

zero (ideal situation)

Problems

The two parts are not equal in size

Branching can negate the prefetching

As a result of the brach instruction, you

have prefetched the wrong

instruction


Alternative approaches

Finer division of the instruction cycle: use

a 6-stage pipeline

Instruction fetch

Decode opcode

Calculate operand address(es)

Fetch operands

Perform execution

Write (store) result

Use multiple execution functional units to

parallelize the actual execution phase of

several instructions

Use branching strategies to minimize

branch impact


10/19

10


Figure 11.12 Pipelined execution of 9 instructions

in 14 time units vs. 54


Figure 11.13 Impact of a branch after instruction 3

(to instruction 15)


11/19

11


Pipeline Limitations

Pipeline depth

If the speedup is based on the number of stages,

why not build lots of stages?

Each stage uses latches at its input (output) to

buffer the next set of inputs

If the stage granularity is reduced too much,

the latches and their control become asignificant hardware overhead

Also suffer a time overhead in the

propagation time through the latches

Limits the rate at which data can be

clocked through the pipeline

Logic to handle memory and register use and to

control the overall pipeline increases

significantly with increasing pipeline depth

Data dependencies also factor into the effective

length of pipelines


Data dependencies

Pipelining, as a form of parallelism, must insure

that computed results are the same as if

computation was performed in strict sequential

order

With multiple stages, two instructions in

execution in the pipeline may have datadependencies -- must design the pipeline to

prevent this

Data dependencies limit when an

instruction can be input to the pipeline

Data dependency examples

A = B + C

D = E + A

C = G x H

A = D / H


12/19

12


Branching

For the pipeline to have the desired operational

speedup, we must feed it with long strings of

instructions

However, 15-20% of instructions in an

assembly-level stream are (conditional)

branches Of these, 60-70% take the branch to a target

address

Impact of the branch is that pipeline never

really operates at its full capacity -- limiting

the performance improvement that is

derived from the pipeline

The average time to complete a pipelined

instruction becomes

Tave =(1-pb)1 + pb[pt(1+b) + (1-pt)1]

A number of techniques can be used to

minimize the impact of the branch instruction

(the branch penalty)


Loss of performance resulting from conditional branches [Lil88]

pe = pbpt


13/19

13


Multiple streams

Replicate the initial portions of the pipeline

and fetch both possible next instructions

Increases chance of memory contention

Must support multiple streams for each

instruction in the pipeline

Prefetch branch target

When the branch instruction is decoded,

begin to fetch the branch target instruction

and place in a second prefetch buffer

If the branch is not taken, the sequential

instructions are already in the pipe -- no

loss of performance

If the branch is taken, the next instruction

has been prefetched and results in minimal

branch penalty (dont have to incur a

memory read operation at the end of the

branch to fetch the instruction)


Look ahead, look behind buffer (loop buffer)

Many conditional branches operations are

used for loop control

Expand prefetch buffer so as to buffer the

last few instructions executed in addition to

the ones that are waiting to be executed

If buffer is big enough, entire loop can beheld in it -- reducing branch penalty

PC

Pending

Instructions

Previous

Instructions


14/19

14


Branch prediction

Make a good guess as to which instruction

will be executed next and start that one

down the pipeline

If the guess turns out to be right, no loss of

performance in the pipeline

If the guess was wrong, empty the pipelineand restart with the correct instruction --

suffering the full branch penalty

Static guesses: make the guess without

considering the runtime history of the

program

Branch never taken

Branch always taken

Predict based on the opcode

Dynamic guesses: track the history of

conditional branches in the program

Taken / not taken switch

History table


Figure 11.16 Branch prediction using 2 history bits


15/19


16/19

16


Superscalar

Implement the CPU such that more than one

instruction can be performed (completed) at a

time

Involves replication of some or all parts of the

CPU/ALU

Examples:

Fetch multiple instructions at the same time

Decode multiple instructions at the same

time

Perform add and multiply at the same time

Perform load/stores while performing ALU

operation

Degree of parallelism and hence the speedup of

the machine goes up as more instructions are

executed in parallel


Figure 13.1 Comparison of superscalar and superpipeline

operation to regular pipelines


17/19

17


Superscalar design limitations

Data dependencies: must insure computed

results are the same as would be computed

on a strictly sequential machine

Two instructions can not be executed in parallel

if the (data) output of one is the input of the

other or if they both write to the same output

location

Consider:

S1: A = B + C

S2: D = A + 1

S3: B = E + F

S4: A = E + 3

Resource dependencies

In the above sequence of instructions, the adder

unit gets a real workout!

Parallelism is limited by the number of addersin the ALU


Instruction issue policy: in what order are

instructions issued to the execution unit

and in what order do they finish?

In-order issue, in-order completion

Simplest method, but severely limits

performance

Strict ordering of instructions: data and

procedural dependencies or resource

conflicts delay all subsequent instructions

Slow execution of some instructions

delay all subsequent instructions

In-order issue, out-of-order completion

Any number of instructions can be executed

at a time

Instruction issue is still limited by resource

conflicts or data and procedural

dependencies Output dependencies resulting from out-of-

order completion must be resolved

Instruction interrupts can be tricky


18/19

18


Out-of-order issue, out-of-order completion

Decode and execute stages are decoupled

via an instruction buffer window

Decoded instructions are stored in the

window awaiting execution

Functional units will take instructions from

the window in an attempt to stay busyThis can result in out-of-order

execution

S1: A = B + C

S2: D = E + 1

S3: G = E + F

S4: H = E * 3

Antidependence class of data

dependencies must be dealt with


Register renaming

Output dependencies and antidependencies are

eliminated by the use of a register pool as

follows

For each instruction that writes to a register

X, a new register X is instantiated

Multiple register Xs can co-exist

Consider

S1: R3 = R3 + R5

S2: R4 = R3 + 1

S3: R3 = R5 + 1

S4: R7 = R3 + R4

becomes

S1: R3b = R3a + R5a

S2: R4b = R3b + 1S3: R3c = R5a + 1

S4: R7b = R3c + R4b


19/19

19


Impact on machine parallelism

Adding (ALU) functional units without register

renaming support may not be cost-effective

Performance is limited by data

dependencies

Out-of-order issue benefits from large

instruction buffer windows

Easier for a functional unit to find a

pending instruction


Summary

In this section, we have focused on the

operation of the CPU

Registers and their use

Instruction execution

Investigated the implementation of

modern CPUs Pipelining

Basic concepts

Limitations to performance

Superpipelining

Superscalar

Documents

This Section Investigates How a Typical CPU