Upload
ullas-kumar
View
219
Download
0
Embed Size (px)
Citation preview
8/14/2019 This Section Investigates How a Typical CPU
1/19
8/14/2019 This Section Investigates How a Typical CPU
2/19
2
EE 4504 Section 8 3
CPU organization
Recall the functions performed by the
CPU:
Fetch instructions
Fetch data
Process data
Write dataOrganizational requirements that are
derived from these functions:
ALU
Control logic
Temporary storage
Means to move data and instructions in and
around the CPU
EE 4504 Section 8 4
Figure 11.1 External view of the CPU
8/14/2019 This Section Investigates How a Typical CPU
3/19
3
EE 4504 Section 8 5
Figure 11.2 Internal structure of the CPU
EE 4504 Section 8 6
Register Organization
Registers form the highest level of the
memory hierarchy
Small set of high speed storage locations
Temporary storage for data and control
information
Two types of registers User-visible
May be referenced by assembly-level
instructions and are thus visible to the
user
Control and status registers
Used to control the operation of the CPU
Most are not visible to the user
8/14/2019 This Section Investigates How a Typical CPU
4/19
4
EE 4504 Section 8 7
User-visible Registers
General categories based on function
General purpose
Can be assigned a variety of functions
Ideally, they are defined orthogonally to the
operations within the instructions
Data
These registers only hold data
Address
These registers only hold address
information
Examples: general purpose address
registers, segment pointers, stack pointers,
index registers
Condition codes
Visible to the user but values set by the
CPU as the result of performing operations
Example code bits: zero, positive, overflow
Bit values are used as the basis for
conditional jump instructions
EE 4504 Section 8 8
Design trade off between general purpose
and specialized registers
General purpose registers maximize flexibility
in instruction design
Special purpose registers permit implicit
register specification in instructions -- reduces
register field size in an instruction
No clear best design approach
How many registers are enough
More registers permit more operands to be held
within the CPU -- reducing memory bandwidth
requirements to some extent
More registers cause an increase in the field
sizes needed to specify registers in an
instruction word
Locality of reference may not support too many
registers Most machines use 8-32 registers (does not
include RISC machines with register
windowing -- will get to that later!)
8/14/2019 This Section Investigates How a Typical CPU
5/19
5
EE 4504 Section 8 9
How big (wide)
Address registers should be wide enough to
hold the longest address address!
Data registers should be wide enough to hold
most data types
Would not want to use 64-bit registers if the
vast majority of data operations used 16 and
32-bit operands
Related to width of memory data bus
Concatenate registers together to store
longer formats
B-C registers in the 8085
AccA-AccB registers in the 68HC11
EE 4504 Section 8 10
Control and status registers
These registers are used during the
fetching, decoding and execution of
instructions
Many are not visible to the user/programmer
Some are visible but can not be (easily)
modified
Typical registers
Program counter
Points to the next instruction to be executed
Instruction register
Contains the instruction being executed
Memory address register
Memory data/buffer register
Program status word(s)
Superset of condition code register
Interrupt masks, supervisory modes, etc.
Status information
8/14/2019 This Section Investigates How a Typical CPU
6/19
6
EE 4504 Section 8 11
Figure 11.3 Example register organizations
EE 4504 Section 8 12
Figure 11.4 Extensions to 32 bits microprocessors
8/14/2019 This Section Investigates How a Typical CPU
7/19
7
EE 4504 Section 8 13
Instruction Cycle
Recall the instruction cycle from Chapter
3:
Fetch the instruction
Decode it
Fetch operands
Perform the operation Store results
Recognize pending interrupts
Based on the addressing techniques from
Chapter 9, we can modify the state
diagram for the cycle to explicitly show
indirection in addressing
Flow of data and information between
registers during the instruction cycle varies
from processor to processor
EE 4504 Section 8 14
Figure 11.7 More complete instruction cycle state diagram
8/14/2019 This Section Investigates How a Typical CPU
8/19
8
EE 4504 Section 8 15
Instruction pipelining
The instruction cycle state diagram clearly
shows the sequence of operations that take
place in order to execute a single
instruction
A good design goal of any system is to
have all of its components performinguseful work all of the time -- high
efficiency
Following the instruction cycle in a
sequential fashion does not permit this
level of efficiency
Compare the instruction cycle to an
automobile assembly line
Perform all tasks concurrently, but on different
(sequential) instructions
The result is temporal parallelism
Result is the instruction pipeline
EE 4504 Section 8 16
An ideal pipeline divides a task into k
independent sequential subtasks
Each subtask requires 1 time unit to complete
The task itself then requires k time units to
complete
For n iterations of the task, the executiontimes will be:
With no pipelining: nk time units
With pipelining: k + (n-1) time units
Speedup of a k-stage pipeline is thus
S = nk / [k+(n-1)] ==> k (for large n)
8/14/2019 This Section Investigates How a Typical CPU
9/19
9
EE 4504 Section 8 17
First step: instruction (pre)fetch
Divide the instruction cycle into two (equal??)
parts
I-fetch
Everything else (execution phase)
While one instruction is in execution, overlap
the prefetching of the next instruction
Assumes the memory bus will be idle at
some point during the execution phase
Reduces the time to fetch an instruction to
zero (ideal situation)
Problems
The two parts are not equal in size
Branching can negate the prefetching
As a result of the brach instruction, you
have prefetched the wrong
instruction
EE 4504 Section 8 18
Alternative approaches
Finer division of the instruction cycle: use
a 6-stage pipeline
Instruction fetch
Decode opcode
Calculate operand address(es)
Fetch operands
Perform execution
Write (store) result
Use multiple execution functional units to
parallelize the actual execution phase of
several instructions
Use branching strategies to minimize
branch impact
8/14/2019 This Section Investigates How a Typical CPU
10/19
10
EE 4504 Section 8 19
Figure 11.12 Pipelined execution of 9 instructions
in 14 time units vs. 54
EE 4504 Section 8 20
Figure 11.13 Impact of a branch after instruction 3
(to instruction 15)
8/14/2019 This Section Investigates How a Typical CPU
11/19
11
EE 4504 Section 8 21
Pipeline Limitations
Pipeline depth
If the speedup is based on the number of stages,
why not build lots of stages?
Each stage uses latches at its input (output) to
buffer the next set of inputs
If the stage granularity is reduced too much,
the latches and their control become asignificant hardware overhead
Also suffer a time overhead in the
propagation time through the latches
Limits the rate at which data can be
clocked through the pipeline
Logic to handle memory and register use and to
control the overall pipeline increases
significantly with increasing pipeline depth
Data dependencies also factor into the effective
length of pipelines
EE 4504 Section 8 22
Data dependencies
Pipelining, as a form of parallelism, must insure
that computed results are the same as if
computation was performed in strict sequential
order
With multiple stages, two instructions in
execution in the pipeline may have datadependencies -- must design the pipeline to
prevent this
Data dependencies limit when an
instruction can be input to the pipeline
Data dependency examples
A = B + C
D = E + A
C = G x H
A = D / H
8/14/2019 This Section Investigates How a Typical CPU
12/19
12
EE 4504 Section 8 23
Branching
For the pipeline to have the desired operational
speedup, we must feed it with long strings of
instructions
However, 15-20% of instructions in an
assembly-level stream are (conditional)
branches Of these, 60-70% take the branch to a target
address
Impact of the branch is that pipeline never
really operates at its full capacity -- limiting
the performance improvement that is
derived from the pipeline
The average time to complete a pipelined
instruction becomes
Tave =(1-pb)1 + pb[pt(1+b) + (1-pt)1]
A number of techniques can be used to
minimize the impact of the branch instruction
(the branch penalty)
EE 4504 Section 8 24
Loss of performance resulting from conditional branches [Lil88]
pe = pbpt
8/14/2019 This Section Investigates How a Typical CPU
13/19
13
EE 4504 Section 8 25
Multiple streams
Replicate the initial portions of the pipeline
and fetch both possible next instructions
Increases chance of memory contention
Must support multiple streams for each
instruction in the pipeline
Prefetch branch target
When the branch instruction is decoded,
begin to fetch the branch target instruction
and place in a second prefetch buffer
If the branch is not taken, the sequential
instructions are already in the pipe -- no
loss of performance
If the branch is taken, the next instruction
has been prefetched and results in minimal
branch penalty (dont have to incur a
memory read operation at the end of the
branch to fetch the instruction)
EE 4504 Section 8 26
Look ahead, look behind buffer (loop buffer)
Many conditional branches operations are
used for loop control
Expand prefetch buffer so as to buffer the
last few instructions executed in addition to
the ones that are waiting to be executed
If buffer is big enough, entire loop can beheld in it -- reducing branch penalty
PC
Pending
Instructions
Previous
Instructions
8/14/2019 This Section Investigates How a Typical CPU
14/19
14
EE 4504 Section 8 27
Branch prediction
Make a good guess as to which instruction
will be executed next and start that one
down the pipeline
If the guess turns out to be right, no loss of
performance in the pipeline
If the guess was wrong, empty the pipelineand restart with the correct instruction --
suffering the full branch penalty
Static guesses: make the guess without
considering the runtime history of the
program
Branch never taken
Branch always taken
Predict based on the opcode
Dynamic guesses: track the history of
conditional branches in the program
Taken / not taken switch
History table
EE 4504 Section 8 28
Figure 11.16 Branch prediction using 2 history bits
8/14/2019 This Section Investigates How a Typical CPU
15/19
8/14/2019 This Section Investigates How a Typical CPU
16/19
16
EE 4504 Section 8 31
Superscalar
Implement the CPU such that more than one
instruction can be performed (completed) at a
time
Involves replication of some or all parts of the
CPU/ALU
Examples:
Fetch multiple instructions at the same time
Decode multiple instructions at the same
time
Perform add and multiply at the same time
Perform load/stores while performing ALU
operation
Degree of parallelism and hence the speedup of
the machine goes up as more instructions are
executed in parallel
EE 4504 Section 8 32
Figure 13.1 Comparison of superscalar and superpipeline
operation to regular pipelines
8/14/2019 This Section Investigates How a Typical CPU
17/19
17
EE 4504 Section 8 33
Superscalar design limitations
Data dependencies: must insure computed
results are the same as would be computed
on a strictly sequential machine
Two instructions can not be executed in parallel
if the (data) output of one is the input of the
other or if they both write to the same output
location
Consider:
S1: A = B + C
S2: D = A + 1
S3: B = E + F
S4: A = E + 3
Resource dependencies
In the above sequence of instructions, the adder
unit gets a real workout!
Parallelism is limited by the number of addersin the ALU
EE 4504 Section 8 34
Instruction issue policy: in what order are
instructions issued to the execution unit
and in what order do they finish?
In-order issue, in-order completion
Simplest method, but severely limits
performance
Strict ordering of instructions: data and
procedural dependencies or resource
conflicts delay all subsequent instructions
Slow execution of some instructions
delay all subsequent instructions
In-order issue, out-of-order completion
Any number of instructions can be executed
at a time
Instruction issue is still limited by resource
conflicts or data and procedural
dependencies Output dependencies resulting from out-of-
order completion must be resolved
Instruction interrupts can be tricky
8/14/2019 This Section Investigates How a Typical CPU
18/19
18
EE 4504 Section 8 35
Out-of-order issue, out-of-order completion
Decode and execute stages are decoupled
via an instruction buffer window
Decoded instructions are stored in the
window awaiting execution
Functional units will take instructions from
the window in an attempt to stay busyThis can result in out-of-order
execution
S1: A = B + C
S2: D = E + 1
S3: G = E + F
S4: H = E * 3
Antidependence class of data
dependencies must be dealt with
EE 4504 Section 8 36
Register renaming
Output dependencies and antidependencies are
eliminated by the use of a register pool as
follows
For each instruction that writes to a register
X, a new register X is instantiated
Multiple register Xs can co-exist
Consider
S1: R3 = R3 + R5
S2: R4 = R3 + 1
S3: R3 = R5 + 1
S4: R7 = R3 + R4
becomes
S1: R3b = R3a + R5a
S2: R4b = R3b + 1S3: R3c = R5a + 1
S4: R7b = R3c + R4b
8/14/2019 This Section Investigates How a Typical CPU
19/19
19
EE 4504 Section 8 37
Impact on machine parallelism
Adding (ALU) functional units without register
renaming support may not be cost-effective
Performance is limited by data
dependencies
Out-of-order issue benefits from large
instruction buffer windows
Easier for a functional unit to find a
pending instruction
EE 4504 Section 8 38
Summary
In this section, we have focused on the
operation of the CPU
Registers and their use
Instruction execution
Investigated the implementation of
modern CPUs Pipelining
Basic concepts
Limitations to performance
Superpipelining
Superscalar