This Section Investigates How a Typical CPU

Embed Size (px)

Citation preview

  • 8/14/2019 This Section Investigates How a Typical CPU

    1/19

  • 8/14/2019 This Section Investigates How a Typical CPU

    2/19

    2

    EE 4504 Section 8 3

    CPU organization

    Recall the functions performed by the

    CPU:

    Fetch instructions

    Fetch data

    Process data

    Write dataOrganizational requirements that are

    derived from these functions:

    ALU

    Control logic

    Temporary storage

    Means to move data and instructions in and

    around the CPU

    EE 4504 Section 8 4

    Figure 11.1 External view of the CPU

  • 8/14/2019 This Section Investigates How a Typical CPU

    3/19

    3

    EE 4504 Section 8 5

    Figure 11.2 Internal structure of the CPU

    EE 4504 Section 8 6

    Register Organization

    Registers form the highest level of the

    memory hierarchy

    Small set of high speed storage locations

    Temporary storage for data and control

    information

    Two types of registers User-visible

    May be referenced by assembly-level

    instructions and are thus visible to the

    user

    Control and status registers

    Used to control the operation of the CPU

    Most are not visible to the user

  • 8/14/2019 This Section Investigates How a Typical CPU

    4/19

    4

    EE 4504 Section 8 7

    User-visible Registers

    General categories based on function

    General purpose

    Can be assigned a variety of functions

    Ideally, they are defined orthogonally to the

    operations within the instructions

    Data

    These registers only hold data

    Address

    These registers only hold address

    information

    Examples: general purpose address

    registers, segment pointers, stack pointers,

    index registers

    Condition codes

    Visible to the user but values set by the

    CPU as the result of performing operations

    Example code bits: zero, positive, overflow

    Bit values are used as the basis for

    conditional jump instructions

    EE 4504 Section 8 8

    Design trade off between general purpose

    and specialized registers

    General purpose registers maximize flexibility

    in instruction design

    Special purpose registers permit implicit

    register specification in instructions -- reduces

    register field size in an instruction

    No clear best design approach

    How many registers are enough

    More registers permit more operands to be held

    within the CPU -- reducing memory bandwidth

    requirements to some extent

    More registers cause an increase in the field

    sizes needed to specify registers in an

    instruction word

    Locality of reference may not support too many

    registers Most machines use 8-32 registers (does not

    include RISC machines with register

    windowing -- will get to that later!)

  • 8/14/2019 This Section Investigates How a Typical CPU

    5/19

    5

    EE 4504 Section 8 9

    How big (wide)

    Address registers should be wide enough to

    hold the longest address address!

    Data registers should be wide enough to hold

    most data types

    Would not want to use 64-bit registers if the

    vast majority of data operations used 16 and

    32-bit operands

    Related to width of memory data bus

    Concatenate registers together to store

    longer formats

    B-C registers in the 8085

    AccA-AccB registers in the 68HC11

    EE 4504 Section 8 10

    Control and status registers

    These registers are used during the

    fetching, decoding and execution of

    instructions

    Many are not visible to the user/programmer

    Some are visible but can not be (easily)

    modified

    Typical registers

    Program counter

    Points to the next instruction to be executed

    Instruction register

    Contains the instruction being executed

    Memory address register

    Memory data/buffer register

    Program status word(s)

    Superset of condition code register

    Interrupt masks, supervisory modes, etc.

    Status information

  • 8/14/2019 This Section Investigates How a Typical CPU

    6/19

    6

    EE 4504 Section 8 11

    Figure 11.3 Example register organizations

    EE 4504 Section 8 12

    Figure 11.4 Extensions to 32 bits microprocessors

  • 8/14/2019 This Section Investigates How a Typical CPU

    7/19

    7

    EE 4504 Section 8 13

    Instruction Cycle

    Recall the instruction cycle from Chapter

    3:

    Fetch the instruction

    Decode it

    Fetch operands

    Perform the operation Store results

    Recognize pending interrupts

    Based on the addressing techniques from

    Chapter 9, we can modify the state

    diagram for the cycle to explicitly show

    indirection in addressing

    Flow of data and information between

    registers during the instruction cycle varies

    from processor to processor

    EE 4504 Section 8 14

    Figure 11.7 More complete instruction cycle state diagram

  • 8/14/2019 This Section Investigates How a Typical CPU

    8/19

    8

    EE 4504 Section 8 15

    Instruction pipelining

    The instruction cycle state diagram clearly

    shows the sequence of operations that take

    place in order to execute a single

    instruction

    A good design goal of any system is to

    have all of its components performinguseful work all of the time -- high

    efficiency

    Following the instruction cycle in a

    sequential fashion does not permit this

    level of efficiency

    Compare the instruction cycle to an

    automobile assembly line

    Perform all tasks concurrently, but on different

    (sequential) instructions

    The result is temporal parallelism

    Result is the instruction pipeline

    EE 4504 Section 8 16

    An ideal pipeline divides a task into k

    independent sequential subtasks

    Each subtask requires 1 time unit to complete

    The task itself then requires k time units to

    complete

    For n iterations of the task, the executiontimes will be:

    With no pipelining: nk time units

    With pipelining: k + (n-1) time units

    Speedup of a k-stage pipeline is thus

    S = nk / [k+(n-1)] ==> k (for large n)

  • 8/14/2019 This Section Investigates How a Typical CPU

    9/19

    9

    EE 4504 Section 8 17

    First step: instruction (pre)fetch

    Divide the instruction cycle into two (equal??)

    parts

    I-fetch

    Everything else (execution phase)

    While one instruction is in execution, overlap

    the prefetching of the next instruction

    Assumes the memory bus will be idle at

    some point during the execution phase

    Reduces the time to fetch an instruction to

    zero (ideal situation)

    Problems

    The two parts are not equal in size

    Branching can negate the prefetching

    As a result of the brach instruction, you

    have prefetched the wrong

    instruction

    EE 4504 Section 8 18

    Alternative approaches

    Finer division of the instruction cycle: use

    a 6-stage pipeline

    Instruction fetch

    Decode opcode

    Calculate operand address(es)

    Fetch operands

    Perform execution

    Write (store) result

    Use multiple execution functional units to

    parallelize the actual execution phase of

    several instructions

    Use branching strategies to minimize

    branch impact

  • 8/14/2019 This Section Investigates How a Typical CPU

    10/19

    10

    EE 4504 Section 8 19

    Figure 11.12 Pipelined execution of 9 instructions

    in 14 time units vs. 54

    EE 4504 Section 8 20

    Figure 11.13 Impact of a branch after instruction 3

    (to instruction 15)

  • 8/14/2019 This Section Investigates How a Typical CPU

    11/19

    11

    EE 4504 Section 8 21

    Pipeline Limitations

    Pipeline depth

    If the speedup is based on the number of stages,

    why not build lots of stages?

    Each stage uses latches at its input (output) to

    buffer the next set of inputs

    If the stage granularity is reduced too much,

    the latches and their control become asignificant hardware overhead

    Also suffer a time overhead in the

    propagation time through the latches

    Limits the rate at which data can be

    clocked through the pipeline

    Logic to handle memory and register use and to

    control the overall pipeline increases

    significantly with increasing pipeline depth

    Data dependencies also factor into the effective

    length of pipelines

    EE 4504 Section 8 22

    Data dependencies

    Pipelining, as a form of parallelism, must insure

    that computed results are the same as if

    computation was performed in strict sequential

    order

    With multiple stages, two instructions in

    execution in the pipeline may have datadependencies -- must design the pipeline to

    prevent this

    Data dependencies limit when an

    instruction can be input to the pipeline

    Data dependency examples

    A = B + C

    D = E + A

    C = G x H

    A = D / H

  • 8/14/2019 This Section Investigates How a Typical CPU

    12/19

    12

    EE 4504 Section 8 23

    Branching

    For the pipeline to have the desired operational

    speedup, we must feed it with long strings of

    instructions

    However, 15-20% of instructions in an

    assembly-level stream are (conditional)

    branches Of these, 60-70% take the branch to a target

    address

    Impact of the branch is that pipeline never

    really operates at its full capacity -- limiting

    the performance improvement that is

    derived from the pipeline

    The average time to complete a pipelined

    instruction becomes

    Tave =(1-pb)1 + pb[pt(1+b) + (1-pt)1]

    A number of techniques can be used to

    minimize the impact of the branch instruction

    (the branch penalty)

    EE 4504 Section 8 24

    Loss of performance resulting from conditional branches [Lil88]

    pe = pbpt

  • 8/14/2019 This Section Investigates How a Typical CPU

    13/19

    13

    EE 4504 Section 8 25

    Multiple streams

    Replicate the initial portions of the pipeline

    and fetch both possible next instructions

    Increases chance of memory contention

    Must support multiple streams for each

    instruction in the pipeline

    Prefetch branch target

    When the branch instruction is decoded,

    begin to fetch the branch target instruction

    and place in a second prefetch buffer

    If the branch is not taken, the sequential

    instructions are already in the pipe -- no

    loss of performance

    If the branch is taken, the next instruction

    has been prefetched and results in minimal

    branch penalty (dont have to incur a

    memory read operation at the end of the

    branch to fetch the instruction)

    EE 4504 Section 8 26

    Look ahead, look behind buffer (loop buffer)

    Many conditional branches operations are

    used for loop control

    Expand prefetch buffer so as to buffer the

    last few instructions executed in addition to

    the ones that are waiting to be executed

    If buffer is big enough, entire loop can beheld in it -- reducing branch penalty

    PC

    Pending

    Instructions

    Previous

    Instructions

  • 8/14/2019 This Section Investigates How a Typical CPU

    14/19

    14

    EE 4504 Section 8 27

    Branch prediction

    Make a good guess as to which instruction

    will be executed next and start that one

    down the pipeline

    If the guess turns out to be right, no loss of

    performance in the pipeline

    If the guess was wrong, empty the pipelineand restart with the correct instruction --

    suffering the full branch penalty

    Static guesses: make the guess without

    considering the runtime history of the

    program

    Branch never taken

    Branch always taken

    Predict based on the opcode

    Dynamic guesses: track the history of

    conditional branches in the program

    Taken / not taken switch

    History table

    EE 4504 Section 8 28

    Figure 11.16 Branch prediction using 2 history bits

  • 8/14/2019 This Section Investigates How a Typical CPU

    15/19

  • 8/14/2019 This Section Investigates How a Typical CPU

    16/19

    16

    EE 4504 Section 8 31

    Superscalar

    Implement the CPU such that more than one

    instruction can be performed (completed) at a

    time

    Involves replication of some or all parts of the

    CPU/ALU

    Examples:

    Fetch multiple instructions at the same time

    Decode multiple instructions at the same

    time

    Perform add and multiply at the same time

    Perform load/stores while performing ALU

    operation

    Degree of parallelism and hence the speedup of

    the machine goes up as more instructions are

    executed in parallel

    EE 4504 Section 8 32

    Figure 13.1 Comparison of superscalar and superpipeline

    operation to regular pipelines

  • 8/14/2019 This Section Investigates How a Typical CPU

    17/19

    17

    EE 4504 Section 8 33

    Superscalar design limitations

    Data dependencies: must insure computed

    results are the same as would be computed

    on a strictly sequential machine

    Two instructions can not be executed in parallel

    if the (data) output of one is the input of the

    other or if they both write to the same output

    location

    Consider:

    S1: A = B + C

    S2: D = A + 1

    S3: B = E + F

    S4: A = E + 3

    Resource dependencies

    In the above sequence of instructions, the adder

    unit gets a real workout!

    Parallelism is limited by the number of addersin the ALU

    EE 4504 Section 8 34

    Instruction issue policy: in what order are

    instructions issued to the execution unit

    and in what order do they finish?

    In-order issue, in-order completion

    Simplest method, but severely limits

    performance

    Strict ordering of instructions: data and

    procedural dependencies or resource

    conflicts delay all subsequent instructions

    Slow execution of some instructions

    delay all subsequent instructions

    In-order issue, out-of-order completion

    Any number of instructions can be executed

    at a time

    Instruction issue is still limited by resource

    conflicts or data and procedural

    dependencies Output dependencies resulting from out-of-

    order completion must be resolved

    Instruction interrupts can be tricky

  • 8/14/2019 This Section Investigates How a Typical CPU

    18/19

    18

    EE 4504 Section 8 35

    Out-of-order issue, out-of-order completion

    Decode and execute stages are decoupled

    via an instruction buffer window

    Decoded instructions are stored in the

    window awaiting execution

    Functional units will take instructions from

    the window in an attempt to stay busyThis can result in out-of-order

    execution

    S1: A = B + C

    S2: D = E + 1

    S3: G = E + F

    S4: H = E * 3

    Antidependence class of data

    dependencies must be dealt with

    EE 4504 Section 8 36

    Register renaming

    Output dependencies and antidependencies are

    eliminated by the use of a register pool as

    follows

    For each instruction that writes to a register

    X, a new register X is instantiated

    Multiple register Xs can co-exist

    Consider

    S1: R3 = R3 + R5

    S2: R4 = R3 + 1

    S3: R3 = R5 + 1

    S4: R7 = R3 + R4

    becomes

    S1: R3b = R3a + R5a

    S2: R4b = R3b + 1S3: R3c = R5a + 1

    S4: R7b = R3c + R4b

  • 8/14/2019 This Section Investigates How a Typical CPU

    19/19

    19

    EE 4504 Section 8 37

    Impact on machine parallelism

    Adding (ALU) functional units without register

    renaming support may not be cost-effective

    Performance is limited by data

    dependencies

    Out-of-order issue benefits from large

    instruction buffer windows

    Easier for a functional unit to find a

    pending instruction

    EE 4504 Section 8 38

    Summary

    In this section, we have focused on the

    operation of the CPU

    Registers and their use

    Instruction execution

    Investigated the implementation of

    modern CPUs Pipelining

    Basic concepts

    Limitations to performance

    Superpipelining

    Superscalar